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Preface 


It has been almost 10 years since the first edition of Econometric Analysis of Cross 
Section and Panel Data was published. The reaction to the first edition was more 
positive than I could have imagined when I began thinking about the project in the 
mid-1990s. Of course, as several of you have kindly and constructively pointed out— 
and as was evident to me the first time I taught out of the book—the first edition was 
hardly perfect. Issues of organization and gaps in coverage were shortcomings that 
I wanted to address in a second edition from early on. Plus, there have been some 
important developments in econometrics that can and should be taught to graduate 
students in economics. 

I doubt this second edition is perfect, either. But I believe it improves the first edi- 
tion in substantive ways. The structure of this edition is similar to the first edition, but 
I have made some changes that will contribute to the reader’s understanding of sev- 
eral topics. For example, Chapter 11, which covers more advanced topics in linear 
panel data models, has been rearranged to progress more naturally through sit- 
uations where instrumental variables are needed in conjuction with methods for ac- 
counting for unobserved heterogeneity. Data problems—including censoring, sample 
selection, attrition, and stratified sampling—are now postponed until Chapters 19 
and 20, after popular nonlinear models are presented under random sampling. I think 
this change will further emphasize a point I tried to make in the first edition: It is 
critical to distinguish between specifying a population model on the one hand and 
the method used to sample the data on the other. As an example, consider the Tobit 
model. In the first edition, I presented the Tobit model as applying to two separate 
cases: (1) a response variable is a corner solution outcome in the population (with the 
corner usually at zero) and (2) the underlying variable in the population is con- 
tinuously distributed but the data collection scheme involves censoring the response 
in some way. Many readers commented that they were happy I made this distinction, 
because empirical researchers often seemed to confuse a corner solution due to eco- 
nomic behavior and a corner that is arbitrarily created by a data censoring mecha- 
nism. Nevertheless, I still found that beginners did not always fully appreciate the 
difference, and poor practice in interpreting estimates lingered. Plus, combining the 
two types of applications of so-called ‘‘censored regression models” gave short shrift 
to true data censoring. In this edition, models for corner solutions in the population 
are treated in Chapter 17, and a variety of data censoring schemes are covered in 
more detail in Chapter 19. 

As in the first edition, I use the approach of specifying a population model and 
imposing assumptions on that model. Until Chapter 19, random sampling is assumed 
to generate the data. Unlike traditional treatments of, say, the linear regression 
model, my approach forces the student to specify the population of interest, propose 
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a model and assumptions in the population, and then worry about data issues. The 
last part is easy under random sampling, and so students can focus on various models 
that are used for populations with different features. The students gain a clear un- 
derstanding that, under random sampling, our ability to identify parameters (and 
other quantities of interest) is a product of our assumed model in the population. 
Later it becomes clear that sampling schemes that depart from random sampling can 
introduce complications for learning about the underlying population. 

The second edition continues to omit some important topics not covered in the first 
edition. The leading ones are simulation methods of estimation and semiparametric/ 
nonparametric estimation. The book by Cameron and Trivedi (2005) does an admi- 
rable job providing accessible introductions to these topics. 

I have added several new problems to each of the chapters. As in the first edition, 
the problems are a blend of methodological questions—some of which lead to 
tweaking existing methods in useful directions—and empirical work. Several data 
sets have been added to further illustate how more advanced methods can be applied. 
The data sets can be accessed by visiting links at the MIT Press website for the book: 
http://mitpress.mit.edu/9780262232586. 


New to the Second Edition 


Earlier I mentioned that I have reorganized some of the material from the first edi- 
tion. I have also added new material, and expanded on some of the existing topics. 
For example, Chapter 6 (in Part II) introduces control function methods in the con- 
text of models linear in parameters, including random coefficient models, and dis- 
cusses when the method is the same as two-stage least squares and when it differs. 
Control function methods can be used for certain systems of equations (Chapter 9) 
and are used regularly for nonlinear models to deal with endogenous explanatory 
variables, or heterogeneity, or both (Part IV). The control function method is con- 
venient for testing whether certain variables are endogenous, and more tests are 
included throughout the book. (Examples include Chapter 15 for binary response 
models and Chapter 18 for count data.) Chapter 6 also contains a more detailed dis- 
cussion of difference-in-differences methods for independently pooled cross sections. 
Chapter 7 now introduces all of the different concepts of exogeneity of the ex- 
planatory variables in the context of panel data models, without explicitly introduc- 
ing unobserved heterogeneity. This chapter also contains a detailed discussion of the 
properties of generalized least squares when an incorrect variance-covariance struc- 
ture is imposed. This general discussion is applied in Chapter 10 to models that 
nominally impose a random effects structure on the variance-covariance matrix. 
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In this edition, Chapter 8 explicitly introduces and analyzes the so-called “‘gener- 
alized instrumental variables” (GIV) estimator. This estimator, used implicitly in 
parts of the first edition, is important for discussing efficient estimation. Further, 
some of the instrumental variables estimators used for panel data models in Chapter 
11 are GIV estimators. It is helpful for the reader to understand the general idea 
underlying GIV, and to see its application to classes of important models. 

Chapter 10, while focusing on traditional estimation methods for unobserved 
effects panel data models, demonstrates more clearly the relationships among ran- 
dom effects, fixed effects, and “correlated random effects” (CRE) models. While the 
first edition used the CRE approach often—especially for nonlinear models—I never 
used the phrase ‘‘correlated random effects,” which I got from Cameron and Trivedi 
(2005). Chapter 10 also provides a detailed treatment of the Hausman test for com- 
paring the random and fixed effects estimators, including demonstrating that the tra- 
ditional way of counting degrees of freedom when aggregate time effects are included 
is wrong and can be very misleading. The important topic of approximating the bias 
from fixed effects estimation and first differencing estimation, as a function of the 
number of available time periods, is also fleshed out. 

Of the eight chapters in Part II, Chapter 11 has been changed the most. The ran- 
dom effects and fixed effects instrumental variables estimators are introduced and 
studied in some detail. These estimators form the basis for estimation of panel data 
models with heterogeneity and endogeneity, such as simultaneous equations models 
or models with measurement error, as well as models with additional orthogonality 
restrictions, such as Hausman and Taylor models. The method of first differencing 
followed by instrumental variables is also given separate treatment. This widely 
adopted approach can be used to estimate static models with endogeneity and dy- 
namic models, such as those studied by Arellano and Bond (1991). The Arellano and 
Bond approach, along with several extensions, are now discussed in Section 11.6. 
Section 11.7 extends the treatment of models with individual-specific slopes, includ- 
ing an analysis of when traditional estimators are consistent for the population 
averaged effect, and new tests for individual-specific slopes. 

As in the first edition, Part III of the book is the most technical, and covers general 
approaches to estimation. Chapter 12 contains several important additions. There is 
a new discussion concerning inference when the first-step estimation of a two-step 
procedure is ignored. Resampling schemes, such as the bootstrap, are discussed in 
more detail, including how one used the bootstrap in microeconometric applications 
with a large cross section and relatively few time periods. The most substantive 
additions are in Sections 12.9 and 12.10, which cover multivariate nonlinear least 
squares and quantile methods, respectively. An important feature of Section 12.9 is 
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that I make a simple link between mutivariate weighted nonlinear least squares—an 
estimation method familiar to economists—and the generalized estimating equations 
(GEE) approach. In effect, these approaches are the same, a point that hopefully 
allows economists to read other literature that uses the GEE nomenclature. 

The section on quantile estimation covers different asymptotic variance estimators 
and discusses how they compare to violation of assumptions in terms of robustness. 
New material on estimating and inference when quantile regression is applied to 
panel data gives researchers simple methods for allowing unobserved effects in 
quantile estimation, while at the same time offering inference that is fully robust to 
arbitrary serial correlation. 

Chapter 13, on maximum likelihood methods, also includes several additions, 
including a general discussion of nonlinear unobserved effects models and the differ- 
ent approaches to accounting for the heterogeneity (broadly, random effects, “fixed” 
effects, and correlated random effects) and different estimation methods (partial 
maximum likelihood or full maximum likelihood). Two-step maximum likelihood 
estimators are covered in more detail, including the case where estimating parameters 
in a first stage can be more efficient than simply plugging in known population values 
in the second stage. Section 13.11 includes new material on quasi-maximum likeli- 
hood estimation (QMLE). This section argues that, for general misspecification, only 
one form of asymptotic variance can be used. The QMLE perspective is attractive in 
that it admits that models are almost certainly wrong, thus we should conduct infer- 
ence on the approximation in a valid way. Vuong’s (1988) model selection tests, for 
nonnested models, is explicitly treated as a way to choose among competing models 
that are allowed to be misspecified. I show how to extend Vuong’s approach to panel 
data applications (as usual, with a relatively small number of time periods). 

Chapter 13 also includes a discussion of QMLE in the linear exponential family 
(LEF) of likelihoods, when the conditional mean is the object of interest. A general 
treatment allows me to appeal to the consistency results, and the methods for infer- 
ence, at several points in Part IV. I emphasize the link between QMLE in the LEF 
and the so-called “generalized linear models” (GLM) framework. It turns out that 
GLM is just a special case of QMLE in the LEF, and this recognition should be 
helpful for studying research conducted from the GLM perspective. A related topic is 
the GEE approach to estimating panel data models. The starting point for GEE in 
panel data is to use (for a generic time period) a likelihood in the LEF, but to regain 
some efficiency that has been lost by not implementing full maximum likelihood by 
using a generalized least squares approach. 

Chapter 14, on generalized method of moments (GMM) and minimum distance 
(MD) estimation, has been slightly reorganized so that the panel data applications 
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come at the end. These applications have also been expanded to include unobserved 
effects models with time-varying loads on the heterogeneity. 

Perhaps for most readers the changes to Part IV will be most noticeable. The ma- 
terial on discrete response models has been split into two chapters (in contrast to the 
rather unwieldy single chapter in the first edition). Because Chapter 15 is the first 
applications-oriented chapter for nonlinear models, I spend more time discussing 
different ways of measuring the magnitudes of the effects on the response probability. 
The two leading choices, the partial effects evaluated at the averages and the average 
partial effect, are discussed in some detail. This discussion carries over for panel data 
models, too. A new subsection on unobserved effects panel data models with un- 
observed heterogeneity and a continuous endogenous explanatory variable shows 
how one can handle both problems in nonlinear models. This chapter contains many 
more empirical examples than the first edition. 

Chapter 16 is new, and covers multinomial and ordered responses. These models 
are now treated in more detail than in the first edition. In particular, specification 
issues are fleshed out and the issues of endogeneity and unobserved heterogeneity (in 
panel data) are now covered in some detail. 

Chapter 17, which was essentially Chapter 16 in the first edition, has been given a 
new title, Corner Solutions Responses, to reflect its focus. In reading Tobin’s (1958) 
paper, I was struck by how he really was talking about the corner solution case— 
data censoring had nothing to do with his analysis. Thus, this chapter returns to the 
roots of the Tobit model, and covers several extensions. An important addition is a 
more extensive treatment of two-part models, which is now in Section 17.6. Hope- 
fully, my unified approach in this section will help clarify the relationships among so- 
called “hurdle” and “selection” models, and show that the latter are not necessarily 
superior. Like Chapter 15, this chapter contains several more empirical applications. 

Chapter 18 covers other kinds of limited dependent variables, particularly count 
(nonnegative integer) outcomes and fractional responses. Recent work on panel data 
methods for fractional responses has been incorporated into this chapter. 

Chapter 19 is an amalgamation of material from several chapters in the first edi- 
tion. The theme of Chapter 19 is data problems. The problem of data censoring— 
where a random sample of units is obtained from the population, but the response 
variable is censored in some way—is given a more in-depth treatment. The extreme 
case of binary censoring is included, along with interval censoring and top coding. 
Readers are shown how to allow for endogenous explanatory variables and un- 
observed heterogeneity in panel data. 

Chapter 19 also includes the problem of not sampling at all from part of the pop- 
ulation (truncated sampling) or not having any information about a response for a 
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subset of the population (incidental truncation). The material on unbalanced panel 
data sets and the problems of incidental truncation and attrition in panel data are 
studied in more detail, including the method of inverse probability weighting for 
correcting for missing data. 

Chapter 20 continues the material on nonrandom sampling, providing a separate 
chapter for stratified sampling and cluster sampling. Stratification and clustering are 
often features of survey data sets, and it is important to know what adjustments are 
required to standard econometric methods. The material on cluster sampling sum- 
marizes recent work on clustering with a small number of clusters. 

The material on treatment effect estimation is now in Chapter 21. While I pre- 
served the setup from the first edition, I have added several more topics. First, I have 
expanded the discussion of matching estimators. Regression discontinuity designs are 
covered in a separate section. 

The final chapter, Chapter 22, now includes the introductory material on duration 
analysis. I have included more empirical examples than were in the first edition. 


Possible Course Outlines 


At Michigan State, I teach a two-semester course to second-year, and some third- 
year, students that covers the material in my book—plus some additional material. I 
assume that the graduate students know, or will study on their own, material from 
Chapters 2 and 3. It helps move my courses along when students are comfortable 
with the basic algebra of probability (conditional expectations, conditional variances, 
and linear projections) as well as the basic limit theorems and manipulations. I typi- 
cally spend a few lectures on Chapters 4, 5, and 6, primarily to provide a bridge be- 
tween a more traditional treatment of the linear model and one that focuses on a 
linear population model under random sampling. Chapter 6 introduces control func- 
tion methods in a simple context and so is worth spending some time on. 

In the first semester (15 weeks), I cover the material (selectively) through Chapter 
17. But I currently skip, in the first semester, the material in Chapter 12 on multi- 
variate nonlinear regression and quantile estimation. Plus, I do not cover the 
asymptotic theory underlying M-estimation in much detail, and I pretty much skip 
Chapter 14 altogether. In effect, the first semester covers the popular linear and 
nonlinear models, for both cross section and panel data, in the context of random 
sampling, providing much of the background needed to justify the large-sample 
approximations. 

In the second semester I return to Chapter 12 and cover quantile estimation. I also 
cover the general quasi-MLE and generalized estimating equations material in 
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Chapter 13. In Chapter 14, I find the minimum distance approach to estimation is 
important as a more advanced estimation method. I cover some of the panel data 
examples from this chapter. I then jump to Chapter 18, which covers count and 
fractional responses. I spend a fair amount of time on Chapters 19 and 20 because 
data problems are especially important in practice, and it is important to understand 
the strengths and weakness of the competing methods. After I cover the main parts of 
Chapter 21 (including regression discontinuity designs) and Chapter 22 (duration 
analysis), I sometimes have extra time. (However, if I were to cover some of the more 
advanced topics in Chapter 21—multivalued and multiple treatments, and dynamic 
treatement effects in the context of panel data—I likely would run out of time.) If I 
do have extra time, I like to provide an introduction to nonparametric and semi- 
parametric methods. Cameron and Trivedi (2005) is accessible for the basic methods, 
while the book by Li and Racine (2007) is comprehensive. Illustrating nonparametric 
methods using the treatment effects material in Chapter 21 seems particularly 
effective. 


Supplements 


A student Solutions Manual is available that includes answers to the odd-numbered 
problems (see http://mitpress.mit.edu/9780262731836). Any instructor who adopts 
the book for a course may have access to all solutions. In addition, I have created a 
set of slides for the two-semester course that I teach. They are available as Scientific 
Word 5.5 files—which can be edited—or as pdf files. For these teaching aids see the 
web page for the second edition: http://mitpress.mit.edu/9780262232586. 
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I INTRODUCTION AND BACKGROUND 


In Part I we introduce the basic approach to econometrics taken throughout the book 
and cover some background material that is important to master before reading the 
remainder of the text. Students who have a solid understanding of the algebra of 
conditional expectations, conditional variances, and linear projections could skip 
Chapter 2, referring to it only as needed. Chapter 3 contains a summary of the 
asymptotic analysis needed to read Part II and beyond. In Part II we introduce ad- 
ditional asymptotic tools that are needed to study nonlinear estimation. 


l Introduction 


1.1 Causal Relationships and Ceteris Paribus Analysis 


The goal of most empirical studies in economics and other social sciences is to de- 
termine whether a change in one variable, say w, causes a change in another variable, 
say y. For example, does having another year of education cause an increase in 
monthly salary? Does reducing class size cause an improvement in student per- 
formance? Does lowering the business property tax rate cause an increase in city 
economic activity? Because economic variables are properly interpreted as random 
variables, we should use ideas from probability to formalize the sense in which a 
change in w causes a change in y. 

The notion of ceteris paribus—that is, holding all other (relevant) factors fixed—is 
at the crux of establishing a causal relationship. Simply finding that two variables 
are correlated is rarely enough to conclude that a change in one variable causes a 
change in another. After all, rarely can we run a controlled experiment that allows a 
simple correlation analysis to uncover causality. Instead, we can use econometric 
methods to effectively hold other factors fixed. 

If we focus on the average, or expected, response, a ceteris paribus analysis entails 
estimating E(y | w, €), the expected value of y conditional on w and e. The vector e— 
whose dimension is not important for this discussion—denotes a set of control vari- 
ables that we would like to explicitly hold fixed when studying the effect of w on the 
expected value of y. The reason we control for these variables is that we think w is 
correlated with other factors that also influence y. If w is continuous, interest centers 
on 0E(y| w,c)/ôw, which is usually called the partial effect of w on E(y | w, c). If w is 
discrete, we are interested in E(y|w,c) evaluated at different values of w, with the 
elements of c fixed at the same specified values. Or, we might average across the dis- 
tribution of c. 

Deciding on the list of proper controls is not always straightforward, and using 
different controls can lead to different conclusions about a causal relationship be- 
tween y and w. This is where establishing causality gets tricky: it is up to us to decide 
which factors need to be held fixed. If we settle on a list of controls, and if all ele- 
ments of c can be observed, then estimating the partial effect of w on E(y|w,c) is 
relatively straightforward. Unfortunately, in economics and other social sciences, 
many elements of c are not observed. For example, in estimating the causal effect of 
education on wage, we might focus on E(wage | educ, exper, abil) where educ is years 
of schooling, exper is years of workforce experience, and abil is innate ability. In this 
case, ¢ = (exper, abil), where exper is observed but abil is not. (It is widely agreed 
among labor economists that experience and ability are two factors we should hold 
fixed to obtain the causal effect of education on wages. Other factors, such as years 
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with the current employer, might belong as well. We can all agree that something 
such as the last digit of one’s social security number need not be included as a con- 
trol, as it has nothing to do with wage or education.) 

As a second example, consider establishing a causal relationship between student 
attendance and performance on a final exam in a principles of economics class. We 
might be interested in E(score | attend, SAT, priGPA), where score is the final exam 
score, attend is the attendance rate, SAT is score on the scholastic aptitude test, and 
priGPA is grade point average at the beginning of the term. We can reasonably col- 
lect data on all of these variables for a large group of students. Is this setup enough 
to decide whether attendance has a causal effect on performance? Maybe not. While 
SAT and priGPA are general measures reflecting student ability and study habits, 
they do not necessarily measure one’s interest in or aptitude for econonomics. Such 
attributes, which are difficult to quantify, may nevertheless belong in the list of con- 
trols if we are going to be able to infer that attendance rate has a causal effect on 
performance. 

In addition to not being able to obtain data on all desired controls, other problems 
can interfere with estimating causal relationships. For example, even if we have good 
measures of the elements of c, we might not have very good measures of y or w. A 
more subtle problem—which we study in detail in Chapter 9—is that we may only 
observe equilibrium values of y and w when these variables are simultaneously de- 
termined. An example is determining the causal effect of conviction rates (w) on city 
crime rates (y). 

A first course in econometrics teaches students how to apply multiple regression 
analysis to estimate ceteris paribus effects of explanatory variables on a response 
variable. In the rest of this book, we will study how to estimate such effects in a 
variety of situations. Unlike most introductory treatments, we rely heavily on con- 
ditional expectations. In Chapter 2 we provide a detailed summary of properties of 
conditional expectations. 


1.2 Stochastic Setting and Asymptotic Analysis 


1.2.1 Data Structures 


In order to give proper treatment to modern cross section and panel data methods, 
we must choose a stochastic setting that is appropriate for the kinds of cross section 
and panel data sets collected for most econometric applications. Naturally, all else 
equal, it is best if the setting is as simple as possible. It should allow us to focus on 
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interpreting assumptions with economic content while not having to worry too much 
about technical regularity conditions. (Regularity conditions are assumptions in- 
volving things such as the number of absolute moments of a random variable that 
must be finite.) 

For much of this book we adopt a random sampling assumption. More precisely, 
we assume that (1) a population model has been specified and (2) an independent, 
identically distributed (i.i.d.) sample can be drawn from the population. Specifying a 
population model—which may be a model of E(y|w,c), as in Section 1.1—requires 
us first to clearly define the population of interest. Defining the relevant population 
may seem to be an obvious requirement. Nevertheless, as we will see in later chapters, 
it can be subtle in some cases. 

An important virtue of the random sampling assumption is that it allows us to 
separate the sampling assumption from the assumptions made on the population 
model. In addition to putting the proper emphasis on assumptions that impinge on 
economic behavior, stating all assumptions in terms of the population is actually 
much easier than the traditional approach of stating assumptions in terms of full data 
matrices. 

Because we will rely heavily on random sampling, it is important to know what it 
allows and what it rules out. Random sampling is often reasonable for cross section 
data, where, at a given point in time, units are selected at random from the popula- 
tion. In this setup, any explanatory variables are treated as random outcomes, along 
with data on response variables. Fixed regressors cannot be identically distributed 
across observations, and so the random sampling assumption technically excludes the 
classical linear model. This feature is actually desirable for our purposes. In Section 
1.4 we provide a brief discussion of why it is important to treat explanatory variables 
as random for modern econometric analysis. 

We should not confuse the random sampling assumption with so-called experi- 
mental data. Experimental data fall under the fixed explanatory variables paradigm. 
With experimental data, researchers set values of the explanatory variables and then 
observe values of the response variable. Unfortunately, true experiments are quite 
rare in economics, and in any case nothing practically important is lost by treating 
explanatory variables that are set ahead of time as being random. It is safe to say that 
no one ever went astray by assuming random sampling in place of independent 
sampling with fixed explanatory variables. 

Random sampling does exclude cases of some interest for cross section analysis. 
For example, the identical distribution assumption is unlikely to hold for a pooled 
cross section, where random samples are obtained from the population at different 
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points in time. This case is covered by independent, not identically distributed (i.n.i.d.) 
observations. Allowing for non-identically distributed observations under indepen- 
dent sampling is not difficult, and its practical effects are easy to deal with. We will 
mention this case at several points in the book after the analyis is done under random 
sampling. We do not cover the i.n.i.d. case explicitly in derivations because little is to 
be gained from the additional complication. 

A situation that does require special consideration occurs when cross section ob- 
servations are not independent of one another. An example is spatial correlation 
models. This situation arises when dealing with large geographical units that cannot 
be assumed to be independent draws from a large population, such as the 50 states in 
the United States. It is reasonable to expect that the unemployment rate in one state 
is correlated with the unemployment rate in neighboring states. While standard esti- 
mation methods—such as ordinary least squares and two-stage least squares—can 
usually be applied in these cases, the asymptotic theory needs to be altered. Key sta- 
tistics often (although not always) need to be modified. We will briefly discuss some 
of the issues that arise in this case for single-equation linear models, but otherwise 
this subject is beyond the scope of this book. For better or worse, spatial correlation 
is often ignored in applied work because correcting the problem can be difficult. 

Cluster sampling also induces correlation in a cross section data set, but in many 
cases it 1s relatively easy to deal with econometrically. For example, retirement saving 
of employees within a firm may be correlated because of common (often unobserved) 
characteristics of workers within a firm or because of features of the firm itself (such 
as type of retirement plan). Each firm represents a group or cluster, and we may 
sample several workers from a large number of firms. As we will see in Chapter 21, 
provided the number of clusters is large relative to the cluster sizes, standard methods 
can correct for the presence of within-cluster correlation. 

Another important issue is that cross section samples often are, either intentionally 
or unintentionally, chosen so that they are not random samples from the population 
of interest. In Chapter 21 we discuss such problems at length, including sample 
selection and stratified sampling. As we will see, even in cases of nonrandom samples, 
the assumptions on the population model play a central role. 

For panel data (or longitudinal data), which consist of repeated observations on the 
same cross section of, say, individuals, households, firms, or cities over time, the 
random sampling assumption initially appears much too restrictive. After all, any 
reasonable stochastic setting should allow for correlation in individual or firm be- 
havior over time. But the random sampling assumption, properly stated, does allow 
for temporal correlation. What we will do is assume random sampling in the cross 
section dimension. The dependence in the time series dimension can be entirely un- 
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restricted. As we will see, this approach is justified in panel data applications with 
many cross section observations spanning a relatively short time period. We will also be 
able to cover panel data sample selection and stratification issues within this paradigm. 

A panel data setup that we will not adequately cover—although the estimation 
methods we cover can usually be used—is seen when the cross section dimension and 
time series dimension are roughly of the same magnitude, such as when the sample 
consists of countries over the post-World War II period. In this case it makes little 
sense to fix the time series dimension and let the cross section dimension grow. The 
research on asymptotic analysis with these kinds of panel data sets is still in its early 
stages, and it requires special limit theory. See, for example, Quah (1994), Pesaran 
and Smith (1995), Kao (1999), Moon and Phillips (2000), Phillips and Moon (2000), 
and Alvarez and Arellano (2003). 


1.2.2 Asymptotic Analysis 


Throughout this book we focus on asymptotic properties, as opposed to finite sample 
properties, of estimators. The primary reason for this emphasis is that finite sample 
properties are intractable for most of the estimators we study in this book. In fact, 
most of the estimators we cover will not have desirable finite sample properties such 
as unbiasedness. Asymptotic analysis allows for a unified treatment of estimation 
procedures, and it (along with the random sampling assumption) allows us to state all 
assumptions in terms of the underlying population. Naturally, asymptotic analysis is 
not without its drawbacks. Occasionally, we will mention when asymptotics can lead 
one astray. In those cases where finite sample properties can be derived, you are 
sometimes asked to derive such properties in the problems. 

In cross section analysis the asymptotics is as the number of observations, denoted 
N throughout this book, tends to infinity. Usually what is meant by this statement is 
obvious. For panel data analysis, the asymptotics is as the cross section dimension 
gets large while the time series dimension is fixed. 


1.3 Some Examples 
In this section we provide two examples to emphasize some of the concepts from the 
previous sections. We begin with a standard example from labor economics. 


Example 1.1 (Wage Offer Function): Suppose that the natural log of the wage offer, 
wage’, is determined as 


log(wage°) = Po + B,educ + f,exper + B,married + u, (1.1) 
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where educ is years of schooling, exper is years of labor market experience, and 
married is a binary variable indicating marital status. The variable u, called the error 
term or disturbance, contains unobserved factors that affect the wage offer. Interest 
lies in the unknown parameters, the f;,. 

We should have a concrete population in mind when specifying equation (1.1). For 
example, equation (1.1) could be for the population of all working women. In this 
case, it will not be difficult to obtain a random sample from the population. 

All assumptions can be stated in terms of the population model. The crucial 
assumptions involve the relationship between u and the observable explanatory vari- 
ables, educ, exper, and married. For example, is the expected value of u given the 
explanatory variables educ, exper, and married equal to zero? Is the variance of u 
conditional on the explanatory variables constant? There are reasons to think the 
answer to both of these questions is no, something we discuss at some length in 
Chapters 4 and 5. The point of raising them here is to emphasize that all such ques- 
tions are most easily couched in terms of the population model. 

What happens if the relevant population is all women over age 18? A problem 
arises because a random sample from this population will include women for whom 
the wage offer cannot be observed because they are not working. Nevertheless, we 
can think of a random sample being obtained, but then wage? is unobserved for 
women not working. 

For deriving the properties of estimators, it is often useful to write the population 
model for a generic draw from the population. Equation (1.1) becomes 


log(wage?) = By + Beduc; + B,exper; + B3married; + uj, (1.2) 


where i indexes person. Stating assumptions in terms of u; and x; = (educi, experi, 
married;) is the same as stating assumptions in terms of u and x. Throughout this 
book, the 7 subscript is reserved for indexing cross section units, such as individual, 
firm, city, and so on. Letters such as j, g, and h will be used to index variables, 
parameters, and equations. 

Before ending this example, we note that using matrix notation to write equation 
(1.2) for all N observations adds nothing to our understanding of the model or sam- 
pling scheme; in fact, it just gets in the way because it gives the mistaken impression 
that the matrices tell us something about the assumptions in the underlying popula- 
tion. It is much better to focus on the population model (1.1). 


The next example is illustrative of panel data applications. 


Example 1.2 (Effect of Spillovers on Firm Output): Suppose that the population is 
all manufacturing firms in a country operating during a given three-year period. A 
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production function describing output in the population of firms is 
log(output,) = 6, + B, log(labor,) + P» log(capital,) 
+ 3 spillover, + quality + ur, t= 1;2,3: (1.3) 


Here, spillover, is a measure of foreign firm concentration in the region containing the 
firm. The term quality contains unobserved factors—such as unobserved managerial 
or worker quality—that affect productivity and are constant over time. The error ur 
represents unobserved shocks in each time period. The presence of the parameters ô+, 
which represent different intercepts in each year, allows for aggregate productivity 
to change over time. The coefficients on /abor,, capital,, and spillover, are assumed 
constant across years. 

As we will see when we study panel data methods, there are several issues in 
deciding how best to estimate the #;. An important one is whether the unobserved 
productivity factors (quality) are correlated with the observable inputs. Also, can we 
assume that spillover, at, say, t = 3 is uncorrelated with the error terms in all time 
periods? 

For panel data it is especially useful to add an i subscript indicating a generic cross 
section observation—in this case, a randomly sampled firm: 


log(outputi,) = 6; + pı log(labor;r) + By log(capitalir) 
+ P3 spillover; + quality; + it, t= 1,2,3. (1.4) 


Equation (1.4) makes it clear that quality; is a firm-specific term that is constant over 
time and also has the same effect in each time period, while u; changes across time 
and firm. Nevertheless, the key issues that we must address for estimation can be 
discussed for a generic i, since the draws are assumed to be randomly made from the 
population of all manufacturing firms. 

Equation (1.4) is an example of another convention we use throughout the book: the 
subscript ¢ is reserved to index time, just as į is reserved for indexing the cross section. 


1.4 Why Not Fixed Explanatory Variables? 


We have seen two examples where, generally speaking, the error in an equation can 
be correlated with one or more of the explanatory variables. This possibility is 
so prevalent in social science applications that it makes little sense to adopt an 
assumption—namely, the assumption of fixed explanatory variables—that rules out 
such correlation a priori. 


10 Chapter 1 


In a first course in econometrics, the method of ordinary least squares (OLS) and 
its extensions are usually learned under the fixed regressor assumption. This is ap- 
propriate for understanding the mechanics of least squares and for gaining experience 
with statistical derivations. Unfortunately, reliance on fixed regressors or, more gen- 
erally, fixed “exogenous” variables can have unintended consequences, especially in 
more advanced settings. For example, in Chapters 7 through 11 we will see that as- 
suming fixed regressors or fixed instrumental variables in panel data models imposes 
often unrealistic restrictions on dynamic economic behavior. This is not just a tech- 
nical point: estimation methods that are consistent under the fixed regressor as- 
sumption, such as generalized least squares, are no longer consistent when the fixed 
regressor assumption is relaxed in interesting ways. 

To illustrate the shortcomings of the fixed regressor assumption in a familiar con- 
text, consider a linear model for cross section data, written for each observation i as 


Yi = Po + XiP + ui, P= 1, 2N; (1.5) 


where x; is a 1 x K vector and f is a K x 1 vector. It is common to see the “ideal” 
assumptions for this model stated as “The errors {u;: i = 1,2,...,N} are iid. with 
E(u;) = 0 and Var(u;) = 07.” (Sometimes the u; are also assumed to be normally 
distributed.) The problem with this statement is that it omits the most important 
consideration: What is assumed about the relationship between u; and x;? If the x; are 
taken as nonrandom—which, evidently, is very often the implicit assumption—then 
u; and x; are independent of one another. In nonexperimental environments this as- 
sumption rules out too many situations of interest. Some important questions, such 
as efficiency comparisons across models with different explanatory variables, cannot 
even be asked in the context of fixed regressors. (See Problems 4.5 and 4.15 of 
Chapter 4 for specific examples.) 

In a random sampling context, the u; are always independent and identically dis- 
tributed, regardless of how they are related to the x;. Assuming that the population 
mean of the error is zero is without loss of generality when an intercept is included 
in the model. Thus, the statement “The errors {u;: i= 1,2,...,N} are iid. with 
E(u;) = 0 and Var(u;) = o°” is vacuous in a random sampling context. Viewing the 
x; as random draws along with y, forces us to think about the relationship between 
the error and the explanatory variables in the population. For example, in the popu- 
lation model y = po + xf + u, is the expected value of u given x equal to zero? Is u 
correlated with one or more elements of x? Is the variance of u given x constant, or 
does it depend on x? These are the questions that are relevant for estimating f and for 
determining how to perform statistical inference. 
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Because our focus is on asymptotic analysis, we have the luxury of allowing for 
random explanatory variables throughout the book, whether the setting is linear 
models, nonlinear models, single-equation analysis, or system analysis. An incidental 
but nontrivial benefit is that, compared with frameworks that assume fixed explan- 
atory variables, the unifying theme of random sampling actually simplifies the 
asymptotic analysis. We will never state assumptions in terms of full data matrices, 
because such assumptions can be imprecise and can impose unintended restrictions 
on the population model. 


2 Conditional Expectations and Related Concepts in Econometrics 


2.1 Role of Conditional Expectations in Econometrics 


As we suggested in Section 1.1, the conditional expectation plays a crucial role 
in modern econometric analysis. Although it is not always explicitly stated, the goal 
of most applied econometric studies is to estimate or test hypotheses about the ex- 
pectation of one variable—called the explained variable, the dependent variable, the 
regressand, or the response variable, and usually denoted y—conditional on a set of 
explanatory variables, independent variables, regressors, control variables, or covari- 
ates, usually denoted x = (x1, X2,..., XK). 

A substantial portion of research in econometric methodology can be interpreted 
as finding ways to estimate conditional expectations in the numerous settings that 
arise in economic applications. As we briefly discussed in Section 1.1, most of the 
time we are interested in conditional expectations that allow us to infer causality 
from one or more explanatory variables to the response variable. In the setup from 
Section 1.1, we are interested in the effect of a variable w on the expected value of 
y, holding fixed a vector of controls, c. The conditional expectation of interest is 
E(y|w,e), which we will call a structural conditional expectation. If we can collect 
data on y, w, and c in a random sample from the underlying population of interest, 
then it is fairly straightforward to estimate E( y | w,c)—especially if we are willing to 
make an assumption about its functional form—in which case the effect of w on 
E(y|w,c), holding c fixed, is easily estimated. 

Unfortunately, complications often arise in the collection and analysis of economic 
data because of the nonexperimental nature of economics. Observations on economic 
variables can contain measurement error, or they are sometimes properly viewed as 
the outcome of a simultaneous process. Sometimes we cannot obtain a random 
sample from the population, which may not allow us to estimate E(y|w,c). Perhaps 
the most prevalent problem is that some variables we would like to control for (ele- 
ments of c) cannot be observed. In each of these cases there is a conditional expec- 
tation (CE) of interest, but it generally involves variables for which the econometrician 
cannot collect data or requires an experiment that cannot be carried out. 

Under additional assumptions—generally called identification assumptions—we 
can sometimes recover the structural conditional expectation originally of interest, 
even if we cannot observe all of the desired controls, or if we only observe equilib- 
rium outcomes of variables. As we will see throughout this text, the details differ 
depending on the context, but the notion of conditional expectation is fundamental. 

In addition to providing a unified setting for interpreting economic models, the CE 
operator is useful as a tool for manipulating structural equations into estimable 
equations. In the next section we give an overview of the important features of the 
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conditional expectations operator. The appendix to this chapter contains a more ex- 
tensive list of properties. 


2.2 Features of Conditional Expectations 


2.2.1 Definition and Examples 


Let y be a random variable, which we refer to in this section as the explained variable, 


and let x = (x1,.%2,...,Xx) be a 1 x K random vector of explanatory variables. If 
E(|y|) < 00, then there is a function, say u: R“ — R, such that 
E(y| x1, X2;,..., XK) = U(X1, X2,- --, XK), (2.1) 


or E(y|x) = u(x). The function u(x) determines how the average value of y changes 
as elements of x change. For example, if y is wage and x contains various individual 
characteristics, such as education, experience, and IQ, then E(wage | educ, exper, IQ) 
is the average value of wage for the given values of educ, exper, and JQ. Technically, 
we should distinguish E(y|x)—-which is a random variable because x is a random 
vector defined in the population—from the conditional expectation when x takes on 
a particular value, such as xo: E(y |x = xo). Making this distinction soon becomes 
cumbersome and, in most cases, is not overly important; for the most part we avoid 
it. When discussing probabilistic features of E(y|x), x is necessarily viewed as a 
random variable. 

Because E(y |x) is an expectation, it can be obtained from the conditional density 
of y given x by integration, summation, or a combination of the two (depending on 
the nature of y). It follows that the conditional expectation operator has the same 
linearity properties as the unconditional expectation operator, and several additional 
properties that are consequences of the randomness of u(x). Some of the statements 
we make are proven in the appendix, but general proofs of other assertions require 
measure-theoretic probabability. You are referred to Billingsley (1979) for a detailed 
treatment. 

Most often in econometrics a model for a conditional expectation is specified to 
depend on a finite set of parameters, which gives a parametric model of E( y |x). This 
considerably narrows the list of possible candidates for u(x). 


Example 2.1: For K = 2 explanatory variables, consider the following examples of 
conditional expectations: 


E(y| x1, x2) = Bo + Bix1 + Box2, (2.2) 
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E(y| x1, x2) = Bo + Bix1 + Byx2 + p3xż, (2.3) 
E(y| x1, X2) = Bo + Bix1 + Byx2 + 3X1 x2, (2.4) 
E(y | x1, x2) = exp[Bo + B; log(x1) + 22x2}, y=, xı >0. (2.5) 


The model in equation (2.2) is linear in the explanatory variables xı and x2. Equation 
(2.3) is an example of a conditional expectation nonlinear in x2, although it is linear 
in xı. As we will review shortly, from a statistical perspective, equations (2.2) and 
(2.3) can be treated in the same framework because they are linear in the parameters 
f;. The fact that equation (2.3) is nonlinear in x has important implications for 
interpreting the #;, but not for estimating them. Equation (2.4) falls into this same 
class: it is nonlinear in x = (x1, x2) but linear in the £. 

Equation (2.5) differs fundamentally from the first three examples in that it is a 
nonlinear function of the parameters £;, as well as of the x;. Nonlinearity in the 
parameters has implications for estimating the £,; we will see how to estimate such 
models when we cover nonlinear methods in Part III. For now, you should note that 
equation (2.5) is reasonable only if y > 0. 


2.2.2 Partial Effects, Elasticities, and Semielasticities 


If y and x are related in a deterministic fashion, say y = f(x), then we are often 
interested in how y changes when elements of x change. In a stochastic setting we 
cannot assume that y = f(x) for some known function and observable vector x be- 
cause there are always unobserved factors affecting y. Nevertheless, we can define the 
partial effects of the x; on the conditional expectation E(y|x). Assuming that u(-) 
is appropriately differentiable and x; is a continuous variable, the partial derivative 
Ou(x)/0x; allows us to approximate the marginal change in E(y|x) when x; is 


increased by a small amount, holding x1,..., Xj-1, Xj+1, - - -Xg constant: 
0 . 
AE(y|x) x E -Ax;, holding x1,...,Xj-1, Xj+1,- - - Xg fixed. (2.6) 
Xj 


The partial derivative of E(y |x) with respect to x; is usually called the partial effect 
of x; on E(y|x) (or, to be somewhat imprecise, the partial effect of x; on y). Inter- 
preting the magnitudes of coefficients in parametric models usually comes from the 
approximation in equation (2.6). 

If x; is a discrete variable (such as a binary variable), partial effects are computed 
by comparing E(y | x) at different settings of x; (for example, zero and one when x; is 
binary), holding other variables fixed. 
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Example 2.1 (continued): In equation (2.2) we have 


dE(y|x) _ dE(y|x) 


Ox 0x2 
As expected, the partial effects in this model are constant. In equation (2.3), 


dE(y|x) _ dE(y|x) _ 

aa en 

so that the partial effect of x; is constant but the partial effect of x. depends on the 
level of x2. In equation (2.4), 


dE(y |x) 


OV) pipe, TON 


Xi 0X = Ba + Bax, 


so that the partial effect of x; depends on x2, and vice versa. In equation (2.5), 


EOIS L epia), SPP) = expt pr (2.7) 


where exp(-) denotes the function E(y |x) in equation (2.5). In this case, the partial 
effects of x; and x both depend on x = (x1, x2). 


Sometimes we are interested in a particular function of a partial effect, such as an 
elasticity. In the determinstic case y = f(x), we define the elasticity of y with respect 
to x; as 


oy xy _ F(x) Xj 
dx; y Ox; f(x)’ 
again assuming that x; is continuous. The right-hand side of equation (2.8) shows 


that the elasticity is a function of x. When y and x are random, it makes sense to use 
the right-hand side of equation (2.8), but where f(x) is the conditional mean, /(x). 


(2.8) 


Therefore, the (partial) elasticity of E(y|x) with respect to x;, holding x1,...,xj-1, 
Xj+1,++-,XK constant, is 
dE(y | x) N Xj = u(x) . Xj (2 9) 
ôx  E(y|x) ôx; ux) l 
If E(y|x) > 0 and x; > 0 (as is often the case), equation (2.9) is the same as 
ô loglE 
oglE(y|x)] (2.10) 


ô log(x;) 
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This latter expression gives the elasticity its interpretation as the approximate per- 
centage change in E(y |x) when x; increases by 1 percent. 


Example 2.1 (continued): In equations (2.2) to (2.5), most elasticities are not con- 
stant. For example, in equation (2.2), the elasticity of E(y |x) with respect to xı is 
(B,x1)/(Bo + B1x1 + Box2), which clearly depends on x; and x2. However, in equa- 
tion (2.5) the elasticity with respect to x; is constant and equal to £}. 


How does equation (2.10) compare with the definition of elasticity based on the 
expected value of log(y)? If y > 0 and x; > 0, we could define the elasticity as 


dE[log(y) |x] 


3 Tost) (2.11) 


This seems to be a natural definition in a model such as log(y) = g(x) + u, where 
g(x) is some function of x and u is an unobserved disturbance with zero mean con- 
ditional on x. How do equations (2.10) and (2.11) compare? Generally, they are dif- 
ferent (since the expected value of the log and the log of the expected value can be 
very different). Zf u is independent of x, then equations (2.10) and (2.11) are the same, 
because then 


E(y|x) =ò- explg(x)], 


where ô = Efexp(w)]. (If u and x are independent, so are exp(u) and exp[g(x)].) As a 
specific example, if 


log(y) = Bo + B, log(x1) + Box. + u, (2.12) 


where u has zero mean and is independent of (x;,.x2), then the elasticity of y with 
respect to xı is f; using either definition of elasticity. If E(u| x) = 0 but u and x are 
not independent, the definitions are generally different. 

In many applications with y > 0, little is lost by viewing equations (2.10) and 
(2.11) as the same, and in some later applications we will not make a distinction. 
Nevertheless, if the error u in a model such as (2.12) is heteroskedastic—that is, 
Var(u|x) depends on x—then equation (2.10) and equation (2.11) can deviate in 
nontrivial ways; see Problem 2.8. Although it is partly a matter of taste, equation 
(2.10) [or, even better, equation (2.9)] is attractive because it applies to any model of 
E(y|x), and so equation (2.10) allows comparison across many different functional 
forms. Plus, equation (2.10) applies even when log( y) is not defined, something that 
is important in Chapters 17 and 18. Definition (2.10) is more general because some- 
times it applies even when log(y) is not defined. (We will need the general definition 
of an elasticity in Chapters 17 and 18.) 
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The percentage change in E(y| x) when xj; is increased by one unit is approximated 
as 


OE(y |x) 1 
100. E2, i 2.13 
y EOT a 
which equals 
100. Z LEEO [X] (2.14) 


Ox; 
if E(y |x) > 0. This is sometimes called the semielasticity of E( y | x) with respect to xj. 


Example 2.1 (continued): In equation (2.5), the semielasticity with respect to x2 
is constant and equal to 100-f/,. No other semielasticities are constant in these 
equations. 


2.2.3. Error Form of Models of Conditional Expectations 


When y is a random variable we would like to explain in terms of observable vari- 
ables x, it is useful to decompose y as 


y=E(y|x) +u, (2.15) 
E(u|x) = 0. (2.16) 


In other words, equations (2.15) and (2.16) are definitional: we can always write y as 
its conditional expectation, E(y |x), plus an error term or disturbance term that has 
conditional mean zero. 

The fact that E(u|x) = 0 has the following important implications: (1) E(u) = 0; 
(2) u is uncorrelated with any function of x1,.%2,...,xx, and, in particular, u is 
uncorrelated with each of x;,x2,...,xx. That u has zero unconditional expectation 
follows as a special case of the law of iterated expectations (LIE), which we cover 
more generally in the next subsection. Intuitively, it is quite reasonable that E(u |x) = 
0 implies E(w) = 0. The second implication is less obvious but very important. The 
fact that u is uncorrelated with any function of x is much stronger than merely saying 
that u is uncorrelated with x1,..., Xg. 

As an example, if equation (2.2) holds, then we can write 


y = Po + 1X1 + Pox. + u, E(u 


xı, x2) = 0, (2.17) 
and so 


E(u) = 0, Cov(x1, u) = 0, Cov(x2,u) = 0. (2.18) 
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But we can say much more: under equation (2.17), u is also uncorrelated with any 
other function we might think of, such as x?, x3, x12, exp(x1), and log(x3 + 1). This 
fact ensures that we have fully accounted for the effects of x; and x2 on the expected 
value of y; another way of stating this point is that we have the functional form of 
E(y |x) properly specified. 

If we only assume equation (2.18), then u can be correlated with nonlinear func- 
tions of x; and x2, such as quadratics, interactions, and so on. If we hope to estimate 
the partial effect of each x; on E(y|x) over a broad range of values for x, we want 
E(u|x) = 0. (In Section 2.3 we discuss the weaker assumption (2.18) and its uses.) 


Example 2.2: Suppose that housing prices are determined by the simple model 
hprice = By + pı sqrft + P» distance + u, 


where sqrft is the square footage of the house and distance is the distance of the house 
from a city incinerator. For f, to represent dE(hprice | sqrft, distance) / 0 distance, we 
must assume that E(u | sqrft, distance) = 0. 


2.2.4 Some Properties of Conditional Expectations 


One of the most useful tools for manipulating conditional expectations is the law of 
iterated expectations, which we mentioned previously. Here we cover the most gen- 
eral statement needed in this book. Suppose that w is a random vector and y is a 
random variable. Let x be a random vector that is some function of w, say x = f(w). 
(The vector x could simply be a subset of w.) This statement implies that if we know 
the outcome of w, then we know the outcome of x. The most general statement of the 
LIE that we will need is 


E(y|x) = E[E(y|w) |x]. (2.19) 


In other words, if we write u, (w) = E(y|w) and w(x) = E(y|x), we can obtain 
(xX) by computing the expected value of y,(w) given x: 44 (x) = Ef (w) | x]. 

There is another result that looks similar to equation (2.19) but is much simpler to 
verify. Namely, 


E(y|x) = E[E(y|x) | w]. (2.20) 


Note how the positions of x and w have been switched on the right-hand side of 
equation (2.20) compared with equation (2.19). The result in equation (2.20) follows 
easily from the conditional aspect of the expection: since x is a function of w, know- 
ing w implies knowing x; given that (x) = E(y |x) is a function of x, the expected 
value of f(x) given w is just w(x). 
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Some find a phrase useful for remembering both equations (2.19) and (2.20): “The 
smaller information set always dominates.” Here, x represents less information than 
w, since knowing w implies knowing x, but not vice versa. We will use equations 
(2.19) and (2.20) almost routinely throughout the book. 

For many purposes we need the following special case of the general LIE (2.19). If 
x and z are any random vectors, then 


E(y|x) = E[E(y| x, z) |x], (2.21) 
or, defining 44 (x, z) = E(y|x,z) and w(x) = E(y|x), 


f(x) = E[u (x, z) | x]. (2.22) 


For many econometric applications, it is useful to think of u(x, z) = E(y|x,z) as 
a structural conditional expectation, but where z is unobserved. If interest lies in 
E(y|x,z), then we want the effects of the x; holding the other elements of x and z 
fixed. If z is not observed, we cannot estimate E(y | x, z) directly. Nevertheless, since 
y and x are observed, we can generally estimate E(y|x). The question, then, is 
whether we can relate E(y|x) to the original expectation of interest. (This is a ver- 
sion of the identification problem in econometrics.) The LIE provides a convenient 
way for relating the two expectations. 

Obtaining E[“,(x,z)|x] generally requires integrating (or summing) j,(x,z) 
against the conditional density of z given x, but in many cases the form of E(y |x, z) 
is simple enough not to require explicit integration. For example, suppose we begin 
with the model 


X1, X2, Z) = By + Byx1 + 2X2 + B3z (2.23) 


but where z is unobserved. By the LIE, and the linearity of the CE operator, 


E(y 


E(y| x1, x2) = E(Bp + Byx1 + Byx2 + B32 | x1, x2) 


= Po + bixi + Box2 + B3E(z| x1, x2). (2.24) 
Now, if we make an assumption about E(z| x1, x2), for example, that it is linear in x; 
and x2, 
E(z | x1, x2) = ĝo + 0x1 + 62Xx2, (2.25) 


then we can plug this into equation (2.24) and rearrange: 
= Bo + By x1 + Byx2 + B3(60 + ô1x1 + 82x2) 


= (Bo + B380) + (L1 + B301) 1 + (Bo + B32) x2. 
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This last expression is E(y| x1, x2); given our assumptions it is necessarily linear in 


(x1, X2). 
Now suppose equation (2.23) contains an interaction in x; and z: 
E(y | x1, 2,2) = By + Bix + Box2 + B3z + Byxiz. (2.26) 


Then, again by the LIE, 
E(y | x1, x2) = Bo + Byx1 + Byx2 + B3E(Z| x1, x2) + Baxi E(z| x1, x2). 


If E(z| x1, x2) is again given in equation (2.25), you can show that E(y|.4,x2) has 
terms linear in x; and x2 and, in addition, contains x? and xıx2. The usefulness of 
such derivations will become apparent in later chapters. 

The general form of the LIE has other useful implications. Suppose that for some 
(vector) function f(x) and a real-valued function g(-), E(y |x) = g[f(x)]. Then 


ED | fœ] = EQ |x) = gfx]. (2.27) 


There is another way to state this relationship: If we define z = f(x), then E(y |z) = 
g(z). The vector z can have smaller or greater dimension than x. This fact is illus- 
trated with the following example. 


Example 2.3: If a wage equation is 
E(wage | educ, exper) = By + B educ + P>exper + Bzexper? + Byeduc-exper, 
then 
E(wage | educ, exper, exper”, educ-exper) 
= By + Byeduc + B,exper + B,exper? + Byeduc-exper. 


In other words, once educ and exper have been conditioned on, it is redundant to 
condition on exper? and educ-exper. 


The conclusion in this example is much more general, and it is helpful for analyz- 
ing models of conditional expectations that are linear in parameters. Assume that, for 
some functions gi (x), g2(x),.--,g(Xx), 


E(y|x) = Bo + Bigi(x) + 292x) + +--+ ugm (x). (2.28) 


This model allows substantial flexibility, as the explanatory variables can appear in 
all kinds of nonlinear ways; the key restriction is that the model is linear in the p. If 
we define z; = gi(x),..-,Z = gm(X), then equation (2.27) implies that 
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E(y|21,22,---,2m) = Bo + B121 + Boz2 +--+ + Buz. (2.29) 


This equation shows that any conditional expectation linear in parameters can 
be written as a conditional expectation linear in parameters and linear in some 
conditioning variables. If we write equation (2.29) in error form as y = fy + By 21 + 
Boz. +--+ Byzu +u, then, because E(u|x) = 0 and the z; are functions of x, it 
follows that u is uncorrelated with z),...,2Z1 (and any functions of them). As we will 
see in Chapter 4, this result allows us to cover models of the form (2.28) in the same 
framework as models linear in the original explanatory variables. 

We also need to know how the notion of statistical independence relates to condi- 
tional expectations. If u is a random variable independent of the random vector x, 
then E(u|x) = E(u), so that if E(u) = 0 and u and x are independent, then E(u |x) = 
0. The converse of this is not true: E(u|x) = E(u) does not imply statistical inde- 
pendence between u and x (just as zero correlation between u and x does not imply 
independence). 


2.2.5 Average Partial Effects 


When we explicitly allow the expectation of the response variable, y, to depend on 
unobservables—usually called unobserved heterogeneity—we must be careful in 
specifying the partial effects of interest. Suppose that we have in mind the (structural) 
conditional mean E(y| x, q) = 4 (x, q), where x is a vector of observable explanatory 
variables and g is an unobserved random variable—the unobserved heterogeneity. 
(We take q to be a scalar for simplicity; the discussion for a vector is essentially the 
same.) For continuous x;, the partial effect of immediate interest is 


0,(x, q) = 0E(y |x, q)/Ax; = dm (x, q) [Ax (2.30) 


(For discrete x;, we would simply look at differences in the regression function for x; 
at two different values, when the other elements of x and q are held fixed.) Because 
0;(x, q) generally depends on q, we cannot hope to estimate the partial effects across 
many different values of q. In fact, even if we could estimate 0;(x, q) for all x and q, 
we would generally have little guidance about inserting values of q into the mean 
function. In many cases we can make a normalization such as E(q) = 0, and estimate 
0;(x,0), but q = 0 typically corresponds to a very small segment of the population. 
(Technically, g = 0 corresponds to no one in the population when q is continuously 
distributed.) Usually of more interest is the partial effect averaged across the popu- 
lation distribution of q; this is called the average partial effect (APE). 

For emphasis, let x° denote a fixed value of the covariates. The average partial 
effect evaluated at x° is 
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ò (x°) = Eq[0;(x’, q), (2.31) 


where E, | - ] denotes the expectation with respect to q. In other words, we simply average 
the partial effect 0;(x°, q) across the population distribution of q. Definition (2.31) holds 
for any population relationship between q and x; in particular, they need not be inde- 
pendent. But remember, in definition (2.31), x° is a nonrandom vector of numbers. 

For concreteness, assume that g has a continuous distribution with density func- 
tion g(-), so that 


6;(x°) = E 0(x°, g)g(g) dy, (2.32) 


where g is simply the dummy argument in the integration. The question we answer 
here is, Is it possible to estimate 0;(x°) from conditional expectations that depend 
only on observable conditioning variables? Generally, the answer must be no, as q 
and x can be arbitrarily related. Nevertheless, if we appropriately restrict the rela- 
tionship between g and x, we can obtain a very useful equivalance. 

One common assumption in nonlinear models with unobserved heterogeneity is 
that q and x are independent. We will make the weaker assumption that q and x are 
independent conditional on a vector of observables, w: 


D(q|x,w) = D(qg|w), (2.33) 


where D(-|-) denotes conditional distribution. (If we take w to be empty, we get the 
special case of independence between q and x.) In many cases, we can interpret 
equation (2.33) as implying that w is a vector of good proxy variables for q, but 
equation (2.33) turns out to be fairly widely applicable. We also assume that w is 
redundant or ignorable in the structural expectation 


E(y|x,q,w) = E(y|x,q). (2.34) 


As we will see in subsequent chapters, many econometric methods hinge on being 
able to exclude certain variables from the equation of interest, and equation (2.34) 
makes this assumption precise. Of course, if w is empty, then equation (2.34) is trivi- 
ally true. 

Under equations (2.33) and (2.34), we can show the following important result, 
provided that we can interchange a certain integral and partial derivative: 


5)(x?) = Ey[0E(y|x°, w) /Axj), (2.35) 


where E,,[-] denotes the expectation with respect to the distribution of w. Before we 
verify equation (2.35) for the special case of continuous, scalar g, we must understand 
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its usefulness. The point is that the unobserved heterogeneity, q, has disappeared en- 

tirely, and the conditional expectation E(y|x,w) can be estimated quite generally 

because we assume that a random sample can be obtained on (y, x, w). (Alternatively, 

when we write down parametric econometric models, we will be able to derive 

E(y |x, w).) Then, estimating the average partial effect at any chosen x° amounts to 

averaging ô (x°, wi) /0x; across the random sample, where (x, w) = E(y|x,w). 
Proving equation (2.35) is fairly simple. First, we have 


f(x, w) = E[E(y|x, q, w) | x, w] = Ela, (x, 4) | x, w] = R (x, q)g(7 |W) dy, 


where the first equality follows from the law of iterated expectations, the second 
equality follows from equation (2.34), and the third equality follows from equation 
(2.33). If we now take the partial derivative with respect to x; of the equality 


m(x,w) = [om (x, a(y|w) dy (2.36) 


and interchange the partial derivative and the integral, we have, for any (x, w), 


aua(x,w)/05j = | G(x. palm) dy (2.37) 


For fixed x°, the right-hand side of equation (2.37) is simply E[0;(x°, 7) |w], and so 
another application of iterated expectations gives, for any x°, 


E,,[Op2(x?, w) /dxj] = E{E[O,(x°, q) | w]} = 4(x°), 


which is what we wanted to show. 

As mentioned previously, equation (2.35) has many applications in models where 
unobserved heterogeneity enters a conditional mean function in a nonadditive fash- 
ion. We will use this result (in simplified form) in Chapter 4, and also extensively in 
Part IV. The special case where q is independent of x—and so we do not need the 
proxy variables w—is very simple: the APE of x; on E(y|x, q) is simply the partial 
effect of x; on f(x) = E(y | x). In other words, if we focus on average partial effects, 
there is no need to introduce heterogeneity. If we do specify a model with heteroge- 
neity independent of x, then we simply find E(y | x) by integrating E(y| x, q) over the 
distribution of q. 

Our discussion of average partial effects is closely related to Blundell and Powell’s 
(2003) analysis of an average structural function (ASE). Blundell and Powell essen- 
tially define the ASE, at a given value x°, to be E,[m,(x°’,g)], where (x, q) = 
E(y|x,q). In other words, the average structural function takes the conditional ex- 
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pectation of interest—a “‘structural” conditional expectation—and averages out the 
unobservable, g. Provided the derivative and expected value can be interchanged— 
which holds under very general assumptions, see Bartle (1966)—the ASE leads us to 
the same place as APEs. For further discussion, see Wooldridge (2005c). We will 
apply equation (2.35) to nonlinear models in several different settings, including 
panel data models and models with endogenous explanatory variables. 


2.3 Linear Projections 


In the previous section we saw some examples of how to manipulate conditional 
expectations. While structural equations are usually stated in terms of CEs, making 
linearity assumptions about CEs involving unobservables or auxiliary variables is 
undesirable, especially if such assumptions can be easily relaxed. 

By using the notion of a linear projection we can often relax linearity assumptions 
in auxiliary conditional expectations. Typically this is done by first writing down a 
structural model in terms of a CE and then using the linear projection to obtain an 
estimable equation. As we will see in Chapters 4 and 5, this approach has many 
applications. 

Generally, let y, x1, ..., xg be random variables representing some population such 
that E(y?) < œ, E(x?) < œ, j= 1,2,...,K. These assumptions place no practical 
restrictions on the joint distribution of (v,x),%2,...,xx): the vector can contain dis- 
crete and continuous variables, as well as variables that have both characteristics. In 
many cases y and the x; are nonlinear functions of some underlying variables that 
are initially of interest. 

Define x = (x1,..., Xg) as a 1 x K vector, and make the assumption that the 
K x K variance matrix of x is nonsingular (positive definite). Then the linear projec- 
tion of y on 1, x1, X2,...,Xx always exists and is unique: 


L(y|1,x1,--.xK) = L(y |1,x) = Bo + Bix + +++ + BexK = Bo + XB, (2.38) 
where, by definition, 
B = [Var(x)|"' Cov(x, y), (2.39) 


By = E(y) — E(x)B = E(y) — B,E(x1) -++ — By E(x). (2.40) 


The matrix Var(x) is the K x K symmetric matrix with (j,4)th element given by 
Cov(x;, xx), while Cov(x, y) is the K x 1 vector with jth element Cov(x;, y). When 
K=1 we have the familiar results p} = Cov(x, y)/Var(xı) and fp = E(y)— 
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B, E(x). As its name suggests, L(y |1, x1, X2, ..., Xg) is always a linear function of 
the Xj. 

Other authors use a different notation for linear projections, the most common 
being E*(-|-) and P(-|-). (For example, Chamberlain (1984) and Goldberger (1991) 
use E*(-|-).) Some authors omit the 1 in the definition of a linear projection because 
it is assumed that an intercept is always included. Although this is usually the case, 
we put unity in explicitly to distinguish equation (2.38) from the case that a zero in- 
tercept is intended. The linear projection of y on x1,X2,...,Xx is defined as 


L(y |x) = L(y| x1, X2,...,%K) = yx + y2X2 + +++ + YKXK = X, 


where y = (E(x’x)) 'E(x’y). Note that y 4 £ unless E(x) = 0. Later, we will include 
unity as an element of x, in which case the linear projection including an intercept 
can be written as L(y|x). 

The linear projection is just another way of writing down a population linear 
model where the disturbance has certain properties. Given the linear projection in 
equation (2.38) we can always write 


V=Bot+ Bix +++: + BeXK +u, (2.41) 


where the error term u has the following properties (by definition of a linear projec- 
tion): E(u?) < œ and 


E(u) = 0, Cov(x;,u) = 0, J= 12, 200,K (2.42) 


In other words, u has zero mean and is uncorrelated with every x;. Conversely, given 
equations (2.41) and (2.42), the parameters £, in equation (2.41) must be the param- 
eters in the linear projection of y on 1,x),...,xx given by definitions (2.39) and 
(2.40). Sometimes we will write a linear projection in error form, as in equations 
(2.41) and (2.42), but other times the notation (2.38) is more convenient. 

It is important to emphasize that when equation (2.41) represents the linear pro- 
jection, all we can say about u is contained in equation (2.42). In particular, it is not 
generally true that u is independent of x or that E(u|x) = 0. Here is another way of 
saying the same thing: equations (2.41) and (2.42) are definitional. Equation (2.41) 
under E(u |x) = 0 is an assumption that the conditional expectation is linear. 

The linear projection is sometimes called the minimum mean square linear predictor 
or the least squares linear predictor because fọ and f can be shown to solve the fol- 
lowing problem: 


min E[(y — bo — xb) ”] (2.43) 
by, be RË 
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(see Property LP.6 in the appendix). Because the CE is the minimum mean square 
predictor—that is, it gives the smallest mean square error out of all (allowable) 
functions (see Property CE.8)—it follows immediately that if E(y|x) is linear in x, 
then the linear projection coincides with the conditional expectation. 

As with the conditional expectation operator, the linear projection operator sat- 
isfies some important iteration properties. For vectors x and z, 


L(y|1,x) = L[L(y] 1,x,z) | 1,x]. (2.44) 


This simple fact can be used to derive omitted variables bias in a general setting as 
well as proving properties of estimation methods such as two-stage least squares and 
certain panel data methods. 

Another iteration property that is useful involves taking the linear projection of a 
conditional expectation: 


L(y|1,x) = L[E(y|x,z)|1,x]. (2.45) 
Often we specify a structural model in terms of a conditional expectation E(y |x, z) 
(which is frequently linear), but, for a variety of reasons, the estimating equations are 
based on the linear projection L(y|1,x). If E(yv|x,z) is linear in x and z, then 
equations (2.45) and (2.44) say the same thing. 

For example, assume that 
E(y| 21, x2) = Bo + Bix1 + Byx2 + B3x1x2 
and define z1 = x,x2. Then, from Property CE.3, 


E(y | x1, X2, 21) = By + Byx1 + Box2 + P321. (2.46) 


The right-hand side of equation (2.46) is also the linear projection of y on 1, x1, X2, 
and 21; it is not generally the linear projection of y on 1, x1, x2. 

Our primary use of linear projections will be to obtain estimable equations 
involving the parameters of an underlying conditional expectation of interest. Prob- 
lems 2.2 and 2.3 show how the linear projection can have an interesting interpreta- 
tion in terms of the structural parameters. 


Problems 


2.1. Given random variables y, xı, and x2, consider the model 


E(y|x1,%2) = Bo + Byx1 + Byx2 + B3x3 + p4xX1 x2. 
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a. Find the partial effects of x; and x2 on E(y| x1, x2). 


b. Writing the equation as 
Y = Bo + fixi + Box2 + B3x5 + Byxix2 + u, 
what can be said about E(u | x1, x2)? What about E(u 


2 
X1, X2, XZ,X1X2)? 


c. In the equation of part b, what can be said about Var(u | x1, x2)? 


2.2. Let y and x be scalars such that 


E(y |x) = ðo + 1(x — u) +8(x = u)’, 


where u = E(x). 
a. Find 0E(y|x)/0x, and comment on how it depends on x. 
b. Show that 0; is equal to ĝE(y 


c. Suppose that x has a symmetric distribution, so that E[(x — u)°] = 0. Show that 
L(y|1,x) = % + 61x for some a. Therefore, the coefficient on x in the linear pro- 
jection of y on (1, x) measures something useful in the nonlinear model for E(y | x): it 
is the partial effect CE(y | x) /0x averaged across the distribution of x. 


x)/0x averaged across the distribution of x. 


2.3. Suppose that 
E(y | x1,2) = Bo + Byx1 + Box2 + B3x1x2. (2.47) 
a. Write this expectation in error form (call the error u), and describe the properties 


of u. 


b. Suppose that x; and x2 have zero means. Show that /, is the expected value of 
OE(y| x1, X2)/0x1 (where the expectation is across the population distribution of x2). 
Provide a similar interpretation for ). 


c. Now add the assumption that x; and x2 are independent of one another. Show 
that the linear projection of y on (1, x1, x2) is 
L(y] 1, x1,%2) = Bo + 1x1 + Byx2. (2.48) 


(Hint: Show that, under the assumptions on x; and x2, x1x2 has zero mean and is 
uncorrelated with x; and x2.) 


d. Why is equation (2.47) generally more useful than equation (2.48)? 


2.4. For random scalars u and v and a random vector x, suppose that E(u | x, v) is a 
linear function of (x, v) and that u and v each have zero mean and are uncorrelated 
with the elements of x. Show that E(u| x, v) = E(u|v) = p,v for some p}. 
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2.5. Consider the two representations 
y=4(x,2) + u, E(u; |x, z) = 0. 
y=m(x) +u,  E(m|x)=0. 


Assuming that Var(y|x,z) and Var(y |x) are both constant, what can you say about 
the relationship between Var(u;) and Var(w2)? (Hint: Use Property CV.4 in the 
appendix.) 


2.6. Let x be a | x K random vector, and let q be a random scalar. Suppose that 
q can be expressed as q = q* +e, where E(e) = 0 and E(x’e) = 0. Write the linear 
projection of q* onto (1,x) as g* = ôo + 01x, +++: +0xxK + r*, where E(r*) = 0 and 
E(x’r*) = 0. 

a. Show that 

L(q|1,x) = ôo + 61x] +: +ôKxK. 


b. Find the projection error r = q — L(g|1,x) in terms of r* and e. 


2.7. Consider the conditional expectation 


E(y|x,z) = g(x) + z£, 


where g(-) is a general function of x and f is a 1 x M vector. Typically, this is called 
a partial linear model. Show that 


EQ |Z) = 2p, 


where y = y — E(y|x) and z = z — E(z|x). Robinson (1988) shows how to use this 
result to estimate $ without specifying g(-). 


2.8. Suppose y is a nonnegative continuous variable generated as 


log(y) = g(x) + u, 
where E(u|x) = 0. Define a(x) = E[exp(u) | x]. 
a. Show that, using definition (2.9), the elasticity of E( y |x) with respect to x; is 


0g(x) ae a(x) xj 


Ox; Ox; a(x) 


Note that the second term is the elasticity of a(x) with respect to xj. 


b. If x; > 0, show that the first part of the expression in part a is 0g(x)/0 log(x;). 
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c. If you apply equation (2.11) to this model, what would you conclude is the “‘elas- 
ticity of y with respect to x’? How does this compare with part b? 


2.9. Let x be a 1x K vector with x; = 1 (to simplify the notation), and define 
u(x) = E(y|x). Let 6 be the K x 1 vector of linear projection coefficients of y on x, 
so that 6 = [E(x’x)]~'E(x’y). Show that ô is also the vector of coefficients in the lin- 
ear projection of u(x) on x. 


2.10. This problem is useful for decomposing the difference in average responses 
across two groups. It is often used, with specific functional form assumptions, to de- 
compose average wages or earnings across two groups; see, for example, Oaxaca and 
Ransom (1994). Let y be the response variable, x a set of explanatory variables, and s 
a binary group indicator. For example, s could denote gender, union membership 
status, college graduate versus non-college graduate, and so on. Define u(x) = 
E(y|x,s = 0) and (x) = E(y|x,s = 1) to be the regression functions for the two 
groups. 


a. Show that 
E(y|s = 1) — E(y |s = 0) = {El (x) |s = 1] — Elyo(x) |s = 1} 
+ {Elx Œ) |s = 1] — Elux) |s = 0}. 


(Hint: First write E(y|x,s) = (1 — s) - wo(x) + s - 4 (x) and use iterated expectations. 
Then use simple algebra.) 


b. Suppose both expectations are linear: u,(x) = xf,, s = 0,1. Show that 


E(y|s= 1) — E(y |s = 0) = E(x |s = 1) - ($1 — Po) + [E(x |s = 1) — E(x | s = 0)] - Bo. 


Can you interpret this decomposition? 


Appendix 2A 


2.A.1 Properties of Conditional Expectations 


PROPERTY CE.1: Let aı(x),...,ac(x) and b(x) be scalar functions of x, and let 
Y1,- --, Yg be random scalars. Then 


G 
(Sa x) y; + B(x) x)= X a(x) E(y;| x) + b(x) 


j=1 
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provided that E(|y;|) < 00, Elja;(x)y;,|] < 00, and E[|b(x)|] < oo. This is the sense in 
which the conditional expectation is a linear operator. 


PROPERTY CE.2: E(y) = E[E(y|x)] = E[y(x)]. 


Property CE.2 is the simplest version of the law of iterated expectations. As an 
illustration, suppose that x is a discrete random vector taking on values ¢1, €2,.. . , CM4 
with probabilities p;, p2,- -., pm. Then the LIE says 


E(y) = pyE(y|x = 1) + pE(y |x = 2) +--+ + pyE(y|x = cm). (2.49) 


In other words, E(y) is simply a weighted average of the E(y|x = c;), where the 
weight p, is the probability that x takes on the value cj. 


PROPERTY CE.3: (1) E(y|x) = E[E(y|w) |x], where x and w are vectors with x = 
f(w) for some nonstochastic function f(-). (This is the general version of the law of 
iterated expectations.) 

(2) As a special case of part 1, E(y |x) = E[E(y|x,z) |x] for vectors x and z. 


PROPERTY CE.4: If f(x) € IR’ is a function of x such that E(y |x) = g[f(x)] for some 
scalar function g(-), then E[y|f(x)] = E(y|x). 


PROPERTY CE.5: If the vector (u, v) is independent of the vector x, then E(u | x, v) = 
E(u|v). 


PROPERTY CE.6: If wu = y— E(y|x), then E/g(x)u] = 0 for any function g(x), pro- 
vided that E[|g;(x)u|] < œ, 7 =1,...,J, and E(|u|) < oo. In particular, E(u) = 0 and 
Cov(xj,u) =0, j= 1,...,K. 


Proof: First, note that 
E(u|x) = E[(y — E(y|x)) |x] = El(y — u(x)) |x] = E(y|x) — u(x) = 0. 


Next, by property CE.2, E/g(x)u] = E(E[g(x)w|x]) = E[g(x)E(u|x)] (by property 
CE.1) = 0 because E(u| x) = 0. 


PROPERTY CE.7 (Conditional Jensen’s Inequality): Ifc: IR — R is a convex function 
defined on R and E||y|] < œ, then 


c[E(y|x)] < Ele(y) |x]. 


Technically, we should add the statement ‘“‘almost surely-P,,’’ which means that the 
inequality holds for all x in a set that has probability equal to one. As a special 
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case, [E(y)]? < E(y2). Also, if y > 0, then —log[E(y)] < E[—log(y)], or Eflog(y)] < 
log[E(y)]. 

PROPERTY CE.8: IfE(y?) < œ and u(x) = E(y |x), then w is a solution to 

min E[(y — m(x))], 

meM 

where M is the set of functions m: RE — R such that Ejm(x)”] < oo. In other words, 
u(x) is the best mean square predictor of y based on information contained in x. 


Proof: By the conditional Jensen’s inequality, if follows that E(y*) < co implies 
E[u(x)7] < œ, so that we M. Next, for any me M, write 


E[(y — m(x))] = ER (y — u(x)) + (u(x) — m(x))}"] 
= E[(y — u(x))°] + E[(u(x) — m(x))*] + 2E[(u(x) — m(x)) a), 
where u = y — u(x). Thus, by CE.6, 
E[(y — m(x))?] = E(u?) + El(ue(x) — m(x)? 
The right-hand side is clearly minimized at m = u. 
2.A.2 Properties of Conditional Variances and Covariances 
The conditional variance of y given x is defined as 
Var(y|x) = (x) = Ey — E9 |x)}? |x] = E(y?| x) - [E |x)]?. 


The last representation is often useful for computing Var(y|x). As with the con- 
ditional expectation, o7(x) is a random variable when x is viewed as a random 
vector. 


PROPERTY CV.1: Varļ[a(x)y + b(x) | x] = [a(x)]? Var(y |x). 
PROPERTY CV.2: Var(y) = E[Var(y|x)] + Var[E(y|x)] = E[o?(x)] + Var[u(x)]. 
Proof: 
Var(y) = El(y — E(y))"] = El(y — E( |x) + E(y|x) + E(y))"] 
= El(y — E(y|x))*] + E(E( |x) - E())”] 
+ 2E[(y — E(y|x))(E(v|x) — EQ))]. 
By CE.6, E[(y — E(y|x))(E(y|x) — E(y))] = 0; so 
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Var(y) = E[(y— E(y|x))*] + E[(E(v| x) — E(y))"] 
= E{E[(y — E(y|x))” | x}} + E[(E(v|x) — E[E(y|x)])’ 
by the law of iterated expectations 
= E[Var(y|x)] + Var[E(y|x)]. 
An extension of Property CV.2 is often useful, and its proof is similar: 


PROPERTY CV.3: Var(y|x) = E[Var(y|x,z) 


x] + Var[E(y|x, z) | x]. 
Consequently, by the law of iterated expectations CE.2, 
PROPERTY CV.4: E[Var(y|x)] > E[Var(y|x,z)]. 


For any function m(-) define the mean squared error as MSE( y;m) = E[(y — m(x))’]. 
Then CV.4 can be loosely stated as MSE/y; E(y|x)] > MSE[y; E(y|x,z)]. In other 
words, in the population one never does worse for predicting y when additional vari- 
ables are conditioned on. In particular, if Var(y|x) and Var(y|x,z) are both con- 
stant, then Var(y |x) > Var(y|x,z). 

There are also important results relating conditional covariances. The most useful 
contains Property CV.3 as a special case (when yı = y2): 


PROPERTY CCOV.1: Cov(y1, ¥2|x) = E[Cov(y1, y2|x,z)|x] + Cov[E(j1 |x, z), 
E(y2 |x, z) | x]. 


Proof: By definition, Cov(y1, y2|x,z) = E{[yı — E(y1|x,z)][y2 — E(y2|x,z)] | 
x,z}. Write 4 (x) = E(y1 |x), vi(x,z) = E(yı |x, z), and similarly for y2. Then sim- 
ple algebra gives 


Cov(y1, y2 |X, Z) 
= E{[yi — (x) + 4 (x) = v1 (x, 2)] + [y2 — Mo (X) + m(x) — v2(x,z)] |x, z} 
= E{[yi — a (%)] [y2 — o(x)] |x, z} + [ar (x) — vı (x, 2)] [2 00) — v2(x,z)] 
+ E{[yi — u (x)(x) — vo(x, z)] |x, z} 
( 
) 


+ Ef[y2 — m (x)][ (x) — vı (x, z)] |x, z} 

= E{[yi — ei (x)| [v2 — 1 (x)] | x, z} + [za (x) — vı (x, 2)] [a (x) — v2(x,z)] 

+ [v1 (x, z) = u (x)(x) — v2(x,2)] + [v2(x, 2) — eo (x)][a (x) — vı (x, 2)] 
= E{[yi — a (x)|[v2 — 4o(x)] |x, z} = [vi (x, z) = ya (x)][v2(x, z) = 40 (x), 
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because the second and third terms cancel. Therefore, by iterated expectations, 
E[Cov(y1, y2|x,2) |x] = E{[y1 — 4 08)] [2 — 4(x)] | x} 
— Ej [vi (x, z) = 4 (x)][v2(x, z) — eo (x)] | x} 
or 
E[Cov(y1, y2|x,Z) |x] = Cov(y1, y2 |x) — Cov[E(y1 |x, z), E(y2 | x, z) | x] 


because w(x) = E[E(yı |x,z)|x], and similarly for (x). Simple rearrangement 
completes the proof. 


2.4.3 Properties of Linear Projections 


In what follows, y is a scalar, x is a 1 x K vector, and z is a 1 x J vector. We allow 
the first element of x to be unity, although the following properties hold in either 
case. All of the variables are assumed to have finite second moments, and the ap- 
propriate variance matrices are assumed to be nonsingular. 


PROPERTY LP.1: If E(y|x) = xf, then L(y| x) = xf. More generally, if 
E(y|x) = Bigi(x) + Bogo(x) +--+ + Buga(x), 

then 

L(y|wi,..., War) = Biwi + Bow +--+ + Bywn, 


where w; = g(x), j= 1,2,..., M. This property tells us that, if E(y |x) is known to 
be linear in some functions g;(x), then this linear function also represents a linear 
projection. 


PROPERTY LP.2: Define u = y —L(y|x) = y — xf. Then E(x’u) = 0. 


PROPERTY LP.3: Suppose y;, j = 1,2,...,@ are each random scalars, and a1, ..., ag 
are constants. Then 


G G 
(Sean | x) = X gL(y; |x). 
j=l j=l 
Thus, the linear projection is a linear operator. 


PROPERTY LP.4 (Law of Iterated Projections): L(y|x) = L[L(y|x,z)|x]. More 
precisely, let 


L(y|x,z) = xf + zy and L(y |x) = xô. 
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For each element of z, write L(z; |x) = xa;, j= 1,...,J, where z; is K x 1. Then 
L(z|x) = xH, where M is the K x J matrix M = (m,m,...,2,). Property LP.4 
implies that 


L(y|x) = L(xB + zy|x) = L(x|x)B+L(z|x)y (by LP.3) 
= xf + (xII); = x(B + Hy). (2.50) 


Thus, we have shown that ô = p + Iy. This is, in fact, the population analogue of the 
omitted variables bias formula from standard regression theory, something we will 
use in Chapter 4. 


Another iteration property involves the linear projection and the conditional 
expectation: 


PROPERTY LP.5:  L(y|x) = L[E(y|x,z) | x]. 


Proof: Write y= u(x,z)+u, where m(x,z) =E(y|x,z). But E(u|x,z) = 0; 
so E(x’u) =0, which implies by LP.3 that L(y|x) = Li[u(x,z) |x] + L(u|x) = 
L[u(x, z) |x] = L[E(y| x, z) |x]. 


A useful special case of Property LP.5 occurs when z is empty. Then L(y|x) = 
L[E(y| x) | x]. 
PROPERTY LP.6: $£ is a solution to 


min E[(y — xb)’]. (2.51) 
beR* 


If E(x’x) is positive definite, then £ is the unique solution to this problem. 
Proof: For any b, write y — xb = (y — xP) + (xB — xb). Then 
(y — xb)” = (y — xB)” + (xB — xb)” + 2(xB — xb)(y — xf) 
= (y — xB)” + (B — b)'x’x(B — b) + 2(B — b)'x'(y — xf). 
Therefore, 
E[(y — xb)”] = E[(y — xB)”] + ($ — b)'E(x’x)(B — b) 
+ 2(B — b)'E[x'(y — xB)] 

= E[(y — xf)*] + (£ — b)'E(x’x)( — b), (2.52) 

because E[x’(y — xf)] = 0 by LP.2. When b= $, the right-hand side of equation 
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(2.52) is minimized. Further, if E(x’x) is positive definite, ($ — b)'E(x’x)(B — b) > 0 
if b # $; so in this case p is the unique minimizer. 
Property LP.6 states that the linear projection is the minimum mean square linear 


predictor. It is not necessarily the minimum mean square predictor: if E(y |x) = u(x) 
is not linear in x, then 


E[(y — u(x))°] < El(y — xf)’. (2.53) 


PROPERTY LP.7: This is a partitioned projection formula, which is useful in a variety 
of circumstances. Write 


L(y|x,z) = xB + zp. (2.54) 


Define the 1 x K vector of population residuals from the projection of x on z as 
r = x — L(x | z). Further, define the population residual from the projection of y on z 
as v = y— L(y |z). Then the following are true: 


L(v|r) = rf (2.55) 
and 
L(y|r) = r£. (2.56) 


The point is that the £ in equations (2.55) and (2.56) is the same as that appearing in 
equation (2.54). Another way of stating this result is 


B = [E(r'r)] 'E(r'v) = [E(r'r)]  E(r'y). (2.57) 
Proof: From equation (2.54) write 

y =Xß +2zy+u, E(x'u) = 0, E(z'u) = 0. (2.58) 
Taking the linear projection gives 

L(y|z) = L(x|z)B + zy. (2.59) 
Subtracting equation (2.59) from (2.58) gives y — L(y |z) = [x — L(x | z)]£ + u, or 
v=rf +u. (2.60) 


Since r is a linear combination of (x,z), E(r'u) = 0. Multiplying equation (2.60) 
through by r’ and taking expectations, it follows that 


B = [E(r'r)] 'E(r'v). 


(We assume that E(r'r) is nonsingular.) Finally, E(r'v) = E[r’(y — L(y |z))] = 
E(r'y), since L(y |z) is linear in z and r is orthogonal to any linear function of z. 


3 Basic Asymptotic Theory 


This chapter summarizes some definitions and limit theorems that are important for 
studying large-sample theory. Most claims are stated without proof, as several re- 
quire tedious epsilon-delta arguments. We do prove some results that build on fun- 
damental definitions and theorems. A good, general reference for background in 
asymptotic analysis is White (2001). In Chapter 12 we introduce further asymptotic 
methods that are required for studying nonlinear models. 


3.1 Convergence of Deterministic Sequences 


Asymptotic analysis is concerned with the various kinds of convergence of sequences 
of estimators as the sample size grows. We begin with some definitions regarding 
nonstochastic sequences of numbers. When we apply these results in econometrics, N 
is the sample size, and it runs through all positive integers. You are assumed to have 
some familiarity with the notion of a limit of a sequence. 


DEFINITION 3.1: (1) A sequence of nonrandom numbers {ay: N = 1,2,...} con- 
verges to a (has limit a) if for all e > 0, there exists N, such that if N > N,, then 
jay — a| < £. We write ay —> a as N > oœ. 

(2) A sequence {ay: N = 1,2,...} is bounded if and only if there is some b < œ 
such that |ay| < b for all N = 1,2,.... Otherwise, we say that {ay} is unbounded. 


These definitions apply to vectors and matrices element by element. 


Example 3.1: (1) If ay =2+1/N, then ay > 2. (2) If ay = (—1)%, then ay does 
not have a limit, but it is bounded. (3) If ay = N'/4, ay is not bounded. Because ay 
increases without bound, we write ay — oo. 


DEFINITION 3.2: (1) A sequence {ay} is O(N?) (at most of order N°) if Ntan 
is bounded. When 2 = 0, {ay} is bounded, and we also write ay = O(1) (big oh 
one). 

(2) {ay} is o(NŻċ) if N~Zay — 0. When 2 = 0, ay converges to zero, and we also 
write ay = 0(1) (little oh one). 


From the definitions, it is clear that if ay = o(N%), then ay = O(N%); in particular, 
if ay = 0(1), then ay = O(1). If each element of a sequence of vectors or matrices 
is O(N“), we say the sequence of vectors or matrices is O(N7), and similarly for 
o(N*). 


Example 3.2: (1) If ay =log(N), then ay = 0(N*) for any 24> 0. (2) If ay = 
10+ VN, then ay = O(N'/) and ay = 0(N"/2+?)) for any y > 0. 
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3.2 Convergence in Probability and Boundedness in Probability 


DEFINITION 3.3: (1) A sequence of random variables {xy: N = 1,2,...} converges in 
probability to the constant a if for all e > 0, 


Pl|xyv — a| > e| — 0 as N > o. 


We write xy > a and say that a is the probability limit (plim) of xy: plim xy = a. 
(2) In the special case where a = 0, we also say that {xy} is op(1) (little oh p one). 
We also write xy = 0,(1) or xy 7,0. 
(3) A sequence of random variables {xy} is bounded in probability if and only if 
for every € > 0, there exists a b, < œ and an integer N, such that 


P||xy| > b] <€ forall N > Nz. 
We write xy = O,(1) ({xx} is big oh p one). 


If cy is a nonrandom sequence, then cy = O,(1) if and only if cy = O(1); cy = 0p(1) 
if and only if cy = o(1). A simple and very useful fact is that if a sequence converges 
in probability to any real number, then it is bounded in probability. 


LEMMA 3.1: If xy a, then xy = O,(1). This lemma also holds for vectors and 
matrices. 


The proof of Lemma 3.1 is not difficult; see Problem 3.1. 


DEFINITION 3.4: (1) A random sequence {xy: N = 1,2,...} is op(ay), where {ay} is 
a nonrandom, positive sequence, if xy /ay = 0,(1). We write xy = 0,(aw). 

(2) A random sequence {xy: N =1,2,...} is O,(ay), where {ay} is a non- 
random, positive sequence, if xyv/ay = O,(1). We write xy = O,(ay). 


We could have started by defining a sequence {xy} to be 0,(N°) for d€R if 
N-®xy + 0, in which case we obtain the definition of Op(1) when ô = 0. This is where 
the one in o,(1) comes from. A similar remark holds for O,(1). 


Example 3.3: If z is a random variable, then xy = VNz is O,(N!/?) and xy = 
op( N°) for any ô > $. 


LEMMA 3.2: If wy = ọ (1), xy = 0p(1), yy = O,(1), and zy = O,(1), then 
(1) wy + xy = 0,(1). 


(2) yy + ZN aaa 
(3) ee (1). 
(4) xvzv = on (1). 
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In derivations, we will write relationships 1 to 4 as 0,(1) + 0,(1) = 0,(1), O,(1) + 
O,(1) = O,(1), O,(1) -O,(1) = O,(1), and o,(1) -O,(1) = 0,(1), respectively. Be- 
cause a 0,(1) sequence is O,(1), Lemma 3.2 also implies that o,(1) + O,(1) = O,(1) 
and o,(1) - 0,(1) = 0, (1). 

All of the previous definitions apply element by element to sequences of random 
vectors or matrices. For example, if {xy} is a sequence of random K x 1 random 
vectors, xy > a, where a is a K x 1 nonrandom vector, if and only if xw; aj, 
j=1,...,K. This is equivalent to ||x, — all Z, 0, where \|b|| = (b’b)'/? denotes the 
Euclidean length of the K x 1 vector b. Also, Zy 2, B, where Zy and B are M x K, 
is equivalent to ||Zy — B|| & 0, where ||A|| = [tr(A’A)]'” and tr(C) denotes the trace 
of the square matrix C. 

A result that we often use for studying the large-sample properties of estimators for 
linear models is the following. It is easily proven by repeated application of Lemma 
3.2 (see Problem 3.2). 


LEMMA 3.3: Let {Zy: N = 1,2,...} be a sequence of J x K matrices such that Zy = 
o,(1), and let {xy} be a sequence of J x 1 random vectors such that xy = O,(1). 
Then Zyxw = 0,(1). 


The next lemma is known as Slutsky’s theorem. 


LEMMA 3.4: Let g: IR“ — R” be a function continuous at some point c e R*. Let 
{xy: N = 1,2,...} be sequence of K x 1 random vectors such that xy Z, e. Then 
g(xy) Š g(c) as N —> œ. In other words, 


plim g(xy) = g(plim xy) (3.1) 
if g(-) is continuous at plim xy. 


Slutsky’s theorem is perhaps the most useful feature of the plim operator: it shows 
that the plim passes through nonlinear functions, provided they are continuous. The 
expectations operator does not have this feature, and this lack makes finite sample 
analysis difficult for many estimators. Lemma 3.4 shows that plims behave just like 
regular limits when applying a continuous function to the sequence. 


DEFINITION 3.5: Let (Q,.4%,P) be a probability space. A sequence of events {Qy: 
N =1,2,...} c F is said to occur with probability approaching one (w.p.a.1) if and 
only if P(Qy) > las N > œ. 


Definition 3.5 allows that Q4,, the complement of Qy, can occur for each N, but its 
chance of occurring goes to zero as N > oo. 
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COROLLARY 3.1: Let {Zy: N = 1,2,...} be a sequence of random K x K matrices, 
and let A be a nonrandom, invertible K x K matrix. If Zy 4 A, then 

(1) Zp exists w.p.a.1; 

(2) Zp! > AW! or plim Zy! = Av! (in an appropriate sense). 


Proof: Because the determinant is a continuous function on the space of all square 
matrices, det(Zy) > det(A). Because A is nonsingular, det(A) 4 0. Therefore, it 
follows that P[det(Zy) 4 0] — 1 as N — œ. This completes the proof of part 1. 

Part 2 requires a convention about how to define Zy' when Zy is nonsingular. Let 
Qy be the set of œw (outcomes) such that Zy(œ) is nonsingular for œ € Qy; we just 
showed that P(Qy) — 1 as N — oo. Define a new sequence of matrices by 


Zy (a) = Zn() when œ e€ Qy, Zy (a) = Ix when œ 3 Qy. 


Then P(Zy = Zy) = P(Qy) > 1 as N > œ. Then, because Zy Z, A, Zy > A. The 
inverse operator is continuous on the space of invertible matrices, so Zy act, 
This is what we mean by Z}! +, Av; the fact that Zy can be singular with vanishing 
probability does not affect asymptotic analysis. 


3.3 Convergence in Distribution 


DEFINITION 3.6: A sequence of random variables {xy: N = 1,2,...} converges in 
distribution to the continuous random variable x if and only if 


Fy(€) = F(é) as N > œ forall če R, 


where Fy is the cumulative distribution function (c.d.f.) of xy and F is the (continu- 
ous) c.d.f. of x. We write xy > x. 


When x ~ Normal(u, o°), we write xy = Normal( u,a?) or xy ~ Normal(y, 07) 
(xy is asymptotically normal). 

In Definition 3.6, xy is not required to be continuous for any N. A good example 
of where xy is discrete for all N but has an asymptotically normal distribution is 
the Demoivre-Laplace theorem (a special case of the central limit theorem given in 
Section 3.4), which says that xy = (sy — Np)/[Np(1 — p)|'” has a limiting standard 
normal distribution, where sy has the binomial (N, p) distribution. 


DEFINITION 3.7: A sequence of K x 1 random vectors {xy: N = 1,2,...} converges 

in distribution to the continuous random vector x if and only if for any K x 1 non- 
d P 

random vector ¢ such that c'c = 1, c'xy > e’x, and we write xy — x. 
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Do ; d 
When x ~ Normal(m, V), the requirement in Definition 3.7 is that e/xy 7 

Normal(c'’m, c'Ve) for every ce IR* such that c'e = 1; in this case we write xy > 

Normal(m, V) or xy ~ Normal(m, V). For the derivations in this book, m = 0. 


LEMMA 3.5: Ifxy 4 x, where x is any K x | random vector, then xy = O,(1). 


As we will see throughout this book, Lemma 3.5 turns out to be very useful for 
establishing that a sequence is bounded in probability. Often it is easiest to first verify 
that a sequence converges in distribution. 


d 
LEMMA 3.6: Let {xy} be a sequence of K x 1 random vectors such that xy — x. If 
g: RË — R” is a continuous function, then g(x) > g(x). 


The usefulness of Lemma 3.6, which is called the continuous mapping theorem, 
cannot be overstated. It tells us that once we know the limiting distribution of xy, we 
can find the limiting distribution of many interesting functions of xy. This is espe- 
cially useful for determining the asymptotic distribution of test statistics once the 
limiting distribution of an estimator is known; see Section 3.5. 

The continuity of g is not necessary in Lemma 3.6, but some restrictions are 
needed. We will need only the form stated in Lemma 3.6. 


COROLLARY 3.2: If {zy} is a sequence of K x 1 random vectors such that zy 2 
Normal(0, V), then 

(1) For any K x M nonrandom matrix A, A'Zy 2 Normal(0, A’VA). 

(2) z4 V~'an 4 xz (or zyV lay ~ y2). 


LEMMA 3.7: Let {xy} and {zy} be sequences of K x 1 random vectors. If zy 42 


Pp d 
and xy — zy — 0, then xy > Z. 


Lemma 3.7 is called the asymptotic equivalence lemma. In Section 3.5.1 we discuss 
generally how Lemma 3.7 is used in econometrics. We use the asymptotic equiva- 
lence lemma so frequently in asymptotic analysis that after a while we will not even 
mention that we are using it. 


3.4 Limit Theorems for Random Samples 


In this section we state two classic limit theorems for independent, identically dis- 
tributed (i.i.d.) sequences of random vectors. These apply when sampling is done 
randomly from a population. 
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THEOREM 3.1: Let {w;: i= 1,2,...} be a sequence of independent, identically dis- 
tributed Gx 1 random vectors such that E(|wig|) < œ, g=1,...,G. Then the 
sequence satisfies the weak law of large numbers (WLLN): N~! > rel Wi EN Hp, Where 
hy = E(w;). 


THEOREM 3.2 (Lindeberg-Levy): Let {w;:i=1,2,...} be a sequence of indepen- 
dent, identically distributed Gx 1 random vectors such that E (wi) < 0, 
g=1,...,G, and E(w;) = 0. Then {w;: i = 1,2,...} satisfies the central limit theorem 
(CLT); that is, 


N 
N71? 5 Wi = Normal(0, B) 

i=] 
where B = Var(w;) = E(w;w}) is necessarily positive semidefinite. For our purposes, 
B is almost always positive definite. 


In most of this text, we will only need convergence results for independent, identi- 
cally distributed observations. Nevertheless, sometimes it will not make sense to as- 
sume identical distributions across i. (Cluster sampling and certain kinds of stratified 
sampling are two examples.) It is important to know that the WLLN and CLT con- 
tinue to hold for independent, not identically distributed (i.n.i.d.) observations under 
rather weak assumptions. See Problem 3.11 for a law of large numbers (which con- 
tains a condition that is not the weakest possible; White (2001) contains theorems 
under weaker conditions). Consistency and asymptotic normality arguments for re- 
gression and other estimators are more complicated but still fairly straightforward. 


3.5 Limiting Behavior of Estimators and Test Statistics 


In this section, we apply the previous concepts to sequences of estimators. Because 
estimators depend on the random outcomes of data, they are properly viewed as 
random vectors. 


3.5.1 Asymptotic Properties of Estimators 


DEFINITION 3.8: Let {ĝy: N = 1,2,...} be a sequence of estimators of the P x 1 
vector 0 € ©, where N indexes the sample size. If 


Oy + 0 (3.2) 


for any value of 0, then we say Oy is a consistent estimator of 0. 
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Because there are other notions of convergence, in the theoretical literature condi- 
tion (3.2) is often referred to as weak consistency. This is the only kind of consistency 
we will be concerned with, so we simply call condition (3.2) consistency. (See White 
(2001, Chap. 2) for other kinds of convergence.) Since we do not know 0, the con- 
sistency definition requires condition (3.2) for any possible value of 0. 


DEFINITION 3.9: Let {Êy: N = 1,2,...} be a sequence of estimators of the P x 1 
vector 0 € ©. Suppose that 


VN(6y — 9) 4 Normal(0, V), (3.3) 


where V is a Px P positive semidefinite matrix. Then we say that Oy is VN- 
asymptotically normally distributed and V is the asymptotic variance of VN (Oy — 0), 
denoted Avar VN (Oy — 0) = V. 


Even though V/N = Var(@y) holds only in special cases, and Êy rarely has an 
exact normal distribution, we treat Oy as if 


Ôn ~ Normal(0, V/N) (3.4) 


whenever statement (3.3) holds. For this reason, V/N is called the asymptotic vari- 
ance of Oy, and we write 


Avar(6y) = V/N. (3.5) 


However, the only sense in which Êy is approximately normally distributed with 
mean 0 and variance V/N is contained in statement (3.3), and this is what is needed 
to perform inference about 0. Statement (3.4) is a heuristic statement that leads to the 
appropriate inference. 

When we discuss consistent estimation of asymptotic variances—a topic that will 
arise often—we should technically focus on estimation of V = Avar VN (Êy — 0). In 
most cases, we will be able to find at least one, and usually more than one, consistent 
estimator Vy of V. Then the corresponding estimator of Avar(@y) is Vy/N, and we 
write 


— 


Avar(6y) = Ww/N. (3.6) 


The division by N in equation (3.6) is practically very important. What we call the 
asymptotic variance of Oy is estimated as in equation (3.6). Unfortunately, there has 
not been a consistent usage of the term “asymptotic variance” in econometrics. 
Taken literally, a statement such as ““Vy/N is consistent for Avar(Oy)” is not very 
meaningful because V/N converges to 0 as N — œ; typically, Vy/N = 0 whether 
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or not Vy is consistent for V. Nevertheless, it is useful to have an admittedly 
imprecise shorthand. In what follows, if we say that “Vy/N consistently estimates 
Avar(6y),” we mean that Vy consistently estimates Avar VN(Oy — 0). 


DEFINITION 3.10: If /N(@y — 0) ~ Normal(0,V) where V is positive definite with 
jth diagonal vj, and Vy 2, V, then the asymptotic standard error of Oyj, denoted 
se(Ôn;), is (ông / N)" 


In other words, the asymptotic standard error of an estimator, which is almost 
always reported in applied work, is the square root of the appropriate diagonal ele- 
ment of Vy /N. The asymptotic standard errors can be loosely thought of as estimating 
the standard deviations of the elements of Êy, and they are the appropriate quantities 
to use when forming (asymptotic) ¢ statistics and confidence intervals. Obtaining 
valid asymptotic standard errors (after verifying that the estimator is asymptotically 
normally distributed) is often the biggest challenge when using a new estimator. 

If statement (3.3) holds, it follows by Lemma 3.5 that VN (Êy — 0) = O,(1), or 
6y —0=0,(N~'/), and we say that Oy is a V/N-consistent estimator of 0. VN- 
consistency certainly implies that plim 6y = 0, but it is much stronger because it tells 
us that the rate of convergence is almost the square root of the sample size N: 
6y —0= op(N~*) for any 0 < c <3. In this book, almost every consistent estimator 
we will study—and every one we consider in any detail—is /N-asymptotically nor- 
mal, and therefore \/N-consistent, under reasonable assumptions. 

If one /N-asymptotically normal estimator has an asymptotic variance that is 
smaller than another’s asymptotic variance (in the matrix sense), it makes it easy to 
choose between the estimators based on asymptotic considerations. 


DEFINITION 3.11: Let Êy and Oy be estimators of 0 each satisfying statement (3.3), 
with asymptotic variances V = Avar VN (Êy — 0) and D = Avar VN (ðy — 0) (these 
generally depend on the value of 0, but we suppress that consideration here). Then 
(1) Oy is asymptotically efficient relative to Oy if D — V is positive semidefinite for 
all 0, 
(2) Oy and Oy are \/N-equivalent if VN (Oy — Ay) = 0,(1). 


When two estimators are /N-equivalent, they have the same limiting distribution 
(multivariate normal in this case, with the same asymptotic variance). This conclu- 
sion follows immediately from the asymptotic equivalence lemma (Lemma 3.7). 
Sometimes, to find the limiting distribution of, say, VN (Oy — 0), it is easiest to first 
find the limiting distribution of VN (ðn — 0), and then to show that Êy and Oy are 
VN-equivalent. A good example of this approach is in Chapter 7, where we find the 
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limiting distribution of the feasible generalized least squares estimator, after we have 
found the limiting distribution of the GLS estimator. 


DEFINITION 3.12: Partition Oy satisfying statement (3.3) into vectors Oy; and Oyo. 
Then yı and @y2 are asymptotically independent if 


v- Vv, 0 

0 vV 
where V, is the asymptotic variance of VN (Êy; — 01) and similarly for V2. In other 
words, the asymptotic variance of VN (Ôn — 0) is block diagonal. 


Throughout this section we have been careful to index estimators by the sample 
size, N. This is useful to fix ideas on the nature of asymptotic analysis, but it is cum- 
bersome when applying asymptotics to particular estimation methods. After this 
chapter, an estimator of @ will be denoted Ê, which is understood to depend on the 
sample size N. When we write, for example, 6, 0, we mean convergence in proba- 
bility as the sample size N goes to infinity. 


3.5.2 Asymptotic Properties of Test Statistics 
We begin with some important definitions in the large-sample analysis of test statistics. 


DEFINITION 3.13: (1) The asymptotic size of a testing procedure is defined as the 
limiting probability of rejecting Ho when Hp is true. Mathematically, we can write 
this as limy— æ Py(reject Ho | Ho), where the N subscript indexes the sample size. 

(2) A test is said to be consistent against the alternative H; if the null hypothesis 
is rejected with probability approaching one when H; is true: limy_.~ Py(reject 
Ho|H) = 1. 


In practice, the asymptotic size of a test is obtained by finding the limiting distribu- 
tion of a test statistic—in our case, normal or chi-square, or simple modifications of 
these that can be used as ¢ distributed or F distributed—and then choosing a critical 
value based on this distribution. Thus, testing using asymptotic methods is practically 
the same as testing using the classical linear model. 

A test is consistent against alternative H; if the probability of rejecting Ho tends to 
unity as the sample size grows without bound. Just as consistency of an estimator is a 
minimal requirement, so is consistency of a test statistic. Consistency rarely allows us 
to choose among tests: most tests are consistent against alternatives that they are 
supposed to have power against. For consistent tests with the same asymptotic size, 
we can use the notion of local power analysis to choose among tests. We will cover 
this briefly in Chapter 12 on nonlinear estimation, where we introduce the notion of 
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local alternatives—that is, alternatives to Ho that converge to Ho at rate 1/ JN. 
Generally, test statistics will have desirable asymptotic properties when they are based 
on estimators with good asymptotic properties (such as efficiency). We now derive the 
limiting distribution of a test statistic that is used very often in econometrics. 


LEMMA 3.8: Suppose that statement (3.3) holds, where V is positive definite. Then 
for any nonstochastic matrix Q x P matrix R, Q < P, with rank(R) = Q, 


VNR(6y — 0) ~ Normal(0, RVR’) 
and 
[VNR (Ôn — 8)]'[RVR'] '[VNR(Oy — 8)] ~ x3. 
In addition, if plim Vy = V then 
[VNR (Oy — 0)\' [RVR] '[VNR(Oy — 0)] 
= (By — O)'R'IR(Vv/N)R'T 'R(Ôv — 0) ~ x6. 
For testing the null hypothesis Hp: RO =r, where r is a Q x 1 nonrandom vector, 
define the Wald statistic for testing Ho against Hı: RO # r as 
Wy = (ROy —¥)'[R(Vy/N)R'] '(ROy — r). (3.7) 


Under Ho, Wy ~ Xo: If we abuse the asymptotics and treat Oy as being distributed 
as Normal(0, Vy /N), we get equation (3.7) exactly. 


LEMMA 3.9: Suppose that statement (3.3) holds, where V is positive definite. Let c: O 
— R? be a continuously differentiable function on the parameter space © c R’, 
where Q < P, and assume that @ is in the interior of the parameter space. Define 
C(0) = Voc(0) as the Q x P Jacobian of c. Then 


VN[c(6y) — e(0)] © Normal|0, C(0)VC(0)'] (3.8) 
and 

{VN[e(Av) — e(0)]}'[C(A)VC(A)'] {VN [c(On) — e()]} ~ xo. 

Define Cy = C(y). Then plim Cy = C(0). If plim Vy = V, then 

{VN e(n) — e(8)]}' [Cx WwCnl {VN [e(Aw) — c(8)]} ~ xo. (3.9) 


Equation (3.8) is very useful for obtaining asymptotic standard errors for nonlin- 
ear functions of Oy. The appropriate estimator of Avar[e(@y)] is Cv(Vv/N)Cy = 
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Cy[Avar(Oy)|C\. Thus, once Avar(@y) and the estimated Jacobian of ¢ are ob- 
tained, we can easily obtain 


Avar|e(y)| = Cy[Avar(Oy)]C\y. (3.10) 


The asymptotic standard errors are obtained as the square roots of the diagonal 
elements of equation (3.10). In the scalar case jy = c(Êy), the asymptotic standard 
error of Jy is [Vec(Oy)[Avar(y)|Voc(Oy) "|". 

Equation (3.9) is useful for testing nonlinear hypotheses of the form Ho: e(0) = 0 
against Hı: c(0) # 0. The Wald statistic is 


Wy = VNe(6y)'[CyWwCy] V Ne(Oy) = e(n) [n(Y /N)C\] 'e(6y). (3.11) 


Under Ho, Wy ~ Xo- 

The method of establishing equation (3.8), given that statement (3.3) holds, is often 
called the delta method, and it is used very often in econometrics. It gets its name 
from its use of calculus. The argument is as follows. Because 0 is in the interior of ©, 
and because plim On = 0, On is in an open, convex subset of © containing 0 with 
probability approaching one, therefore w.p.a.1 we can use a mean value expansion 
c(Oy) = e(0) + Cy - (Ôw — 0), where Cy denotes the matrix C(@) with rows eval- 
uated at mean values between Êy and 0. Because these mean values are trapped be- 
tween Oy and 9, they converge in probability to 0. Therefore, by Slutsky’s theorem, 
Ces C(0), and we can write 


VN[c(6w) — ¢(8)] = Cy - VN (Ôn — 0), 
= C(0)VN(Oy — 0) + [Cy — C(0)| VN (Oy — 0), 
= C(0)VN (Oy — 0) + 0,(1) -O,(1) = C(0)VN(On — 0) + 0,(1). 


We can now apply the asymptotic equivalence lemma (Lemma 3.7) and Lemma 3.8 
[with R = C(@)] to get equation (3.8). 


Problems 


3.1. Prove Lemma 3.1. 

3.2. Using Lemma 3.2, prove Lemma 3.3. 

3.3. Explain why, under the assumptions of Lemma 3.4, g(x) = O,(1). 
3.4. Prove Corollary 3.2. 
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3.5. Let {y;: i= 1,2,...} be an independent, identically distributed sequence with 
E(y?) < œ. Let u = E(y,) and o? = Var(y,). 

a. Let yy denote the sample average based on a sample size of N. Find 
Var[VN (Py — u)]. 

b. What is the asymptotic variance of VN (Py — u)? 

c. What is the asymptotic variance of py? Compare this with Var(jy). 

d. What is the asymptotic standard deviation of Jy? 


e. How would you obtain the asymptotic standard error of Py? 


3.6. Give a careful (albeit short) proof of the following statement: If VN(Ôn — 0) = 
O,(1), then 6y — 0 = 0,(N~) for any0<c<}. 

3.7. Let Ô be a VN -asymptotically normal estimator for the scalar 0 > 0. Let 
y = log(@) be an estimator of y = log(0). 

a. Why is ĵ a consistent estimator of y? 

b. Find the asymptotic variance of VN (9 — y) in terms of the asymptotic variance of 
VN(0- 0). 

c. Suppose that, for a sample of data, 6=4 and se(0) = 2. What is f and its 
(asymptotic) standard error? 

d. Consider the null hypothesis Ho: 0 = 1. What is the asymptotic f¢ statistic for 
testing Ho, given the numbers from part c? 


e. Now state Ho from part d equivalently in terms of y, and use f and se(f) to test 
Ho. What do you conclude? 


3.8. Let Ê= (61, 6)! be a //N-asymptotically normal estimator for 0 = (01, 02)’, 
with 02 4 0. Let p= 6, /0> be an estimator of y = 01/02. 

a. Show that plim 7 = y. 

b. Find Avar(ĵ) in terms of 0 and Avar(@) using the delta method. 


^ A 1 —.4 
c. If, for a sample of data, @ = (—1.5,.5)’ and Avar(@) is estimated as ( 4 ), 
find the asymptotic standard error of 9. o 


3.9. Let Ô and @ be two consistent, vV N-asymptotically normal estimators of the 
Px 1 parameter vector 0, with Avar VN(Ô-— 0) = V, and Avar VN(ĝ — 0) = Vp. 
Define a Q x 1 parameter vector by y = g(0), where g(-) is a continuously differ- 
entiable function. Show that, if Ê is asymptotically more efficient than 6, then 7 = 
g(0) is asymptotically efficient relative to > = g(8). 
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3.10. Let {w;:i=1,2,...} be a sequence of independent, identically distributed 
random variables with E(w?) < œ and E(w;) = 0. Define the standardized partial 
sum as xy = N" 5A ı Wi. Use Chebyshev’s inequality (see, for example, Casella 
and Berger (2002, p. 122)) to prove that xy = O,(1). Therefore, we can show directly 
that a standardized partial sum of 1.1.d. random variables with finite second moment 
is O,(1), rather than having to appeal to the central limit theorem. 


3.11. Let {w;:i=1,2,...} be a sequence of independent, not (necessarily) iden- 
tically distributed random variables with E(w?) < œ, i=1,2,... and E(w;) = H; 
i=1,2,.... For each i, define o? = Var(w;). 

a. Use Chebyshev’s inequality to show that a sufficient condition for 
NEA (wi — u) S 0 is N-2 2%, 6? 4 0 as N> o. 

b. Is the condition from part a satisfied if o? < b < œ, i=1,2,...? (See White 
(2001) for weaker conditions under which the WLLN holds for i.n.i.d. sequences.) 


II LINEAR MODELS 


In Part II we begin our econometric analysis of linear models for cross section and 
panel data. In Chapter 4 we review the single-equation linear model and discuss 
ordinary least squares estimation. Although this material is, in principle, review, the 
approach is likely to be different from an introductory linear models course. In ad- 
dition, we cover several topics that are not traditionally covered in texts but that have 
proven useful in empirical work. Chapter 5 discusses instrumental variables estima- 
tion of the linear model, and Chapter 6 covers some remaining topics to round out 
our treatment of the single-equation model. 

Chapter 7 begins our analysis of systems of equations. The general setup is that the 
number of population equations is small relative to the (cross section) sample size. 
This allows us to cover seemingly unrelated regression models for cross section data 
as well as begin our analysis of panel data. Chapter 8 builds on the framework from 
Chapter 7 but considers the case where some explanatory variables may be uncorre- 
lated with the error terms. Generalized method of moments estimation is the unifying 
theme. Chapter 9 applies the methods of Chapter 8 to the estimation of simultaneous 
equations models, with an emphasis on the conceptual issues that arise in applying 
such models. 

Chapter 10 explicitly introduces unobserved-effects linear panel data models. Under 
the assumption that the explanatory variables are strictly exogenous conditional on 
the unobserved effect, we study several estimation methods, including fixed effects, 
first differencing, and random effects. The last method assumes, at a minimum, 
that the unobserved effect is uncorrelated with the explanatory variables in all time 
periods. Chapter 11 considers extensions of the basic panel data model, including 
failure of the strict exogeneity assumption and models with individual-specific slopes. 


4 Single-Equation Linear Model and Ordinary Least Squares Estimation 


4.1 Overview of the Single-Equation Linear Model 


This and the next couple of chapters cover what is still the workhorse in empirical 
economics: the single-equation linear model. Though you are assumed to be com- 
fortable with ordinary least squares (OLS) estimation, we begin with OLS for a 
couple of reasons. First, it provides a bridge between more traditional approaches 
to econometrics, which treat explanatory variables as fixed, and the current ap- 
proach, which is based on random sampling with stochastic explanatory variables. 
Second, we cover some topics that receive at best cursory treatment in first-semester 
texts. These topics, such as proxy variable solutions to the omitted variable problem, 
arise often in applied work. 
The population model we study is linear in its parameters, 


y = Po + bixi + paxa +-+- + Bexx +u, (4.1) 


where y, x1, X2,X3;,...,Xg are observable random scalars (that is, we can observe 
them in a random sample of the population), u is the unobservable random distur- 
bance or error, and fo, $1,2,- -, 2g are the parameters (constants) we would like to 
estimate. 

The error form of the model in equation (4.1) is useful for presenting a unified 
treatment of the statistical properties of various econometric procedures. Neverthe- 
less, the steps one uses for getting to equation (4.1) are just as important. Goldberger 
(1972) defines a structural model as one representing a causal relationship, as opposed 
to a relationship that simply captures statistical associations. A structural equation 
can be obtained from an economic model, or it can be obtained through informal 
reasoning. Sometimes the structural model is directly estimable. Other times we must 
combine auxiliary assumptions about other variables with algebraic manipulations 
to arrive at an estimable model. In addition, we will often have reasons to estimate 
nonstructural equations, sometimes as a precursor to estimating a structural equation. 

The error term u can consist of a variety of things, including omitted variables 
and measurement error (we will see some examples shortly). The parameters £, 
hopefully correspond to the parameters of interest, that is, the parameters in an un- 
derlying structural model. Whether this is the case depends on the application and the 
assumptions made. 

As we will see in Section 4.2, the key condition needed for OLS to consistently 
estimate the J, (assuming we have available a random sample from the population) is 
that the error (in the population) has mean zero and is uncorrelated with each of the 
regressors: 


E(u) = 0, Cov(x;,u) = 0, J= l 2r Ke (4.2) 
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The zero-mean assumption is for free when an intercept is included, and we will 
restrict attention to that case in what follows. It is the zero covariance of u with each 
x; that is important. From Chapter 2 we know that equation (4.1) and assumption 
(4.2) are equivalent to defining the linear projection of y onto (1, x1, X2,..., Xg) as 
Bo + Bixi + 2X2 +: ++ + KXK- 

Sufficient for assumption (4.2) is the zero conditional mean assumption 


E(u | X1, x2, .-., Xg) = E(u | x) = 0. (4.3) 


Under equation (4.1) and assumption (4.3), we have the population regression 
function 


E(y| x1, X2, ei XK) = Po + Bix + ByX2 apt + Bexx. (4.4) 


As we saw in Chapter 2, equation (4.4) includes the case where the x; are nonlinear 
functions of underlying explanatory variables, such as 


E(savings | income, size, age, college) = By + B, log(income) + size + Page 
+ P4 college + p; college-age. 


We will study the asymptotic properties of OLS primarily under assumption (4.2), 
because it is weaker than assumption (4.3). As we discussed in Chapter 2, assumption 
(4.3) is natural when a structural model is directly estimable because it ensures that 
no additional functions of the explanatory variables help to explain y. 

An explanatory variable x; is said to be endogenous in equation (4.1) if it is corre- 
lated with u. You should not rely too much on the meaning of “endogenous” from 
other branches of economics. In traditional usage, a variable is endogenous if it is 
determined within the context of a model. The usage in econometrics, while related to 
traditional definitions, has evolved to describe any situation where an explanatory 
variable is correlated with the disturbance. If x; is uncorrelated with u, then x; is said 
to be exogenous in equation (4.1). If assumption (4.3) holds, then each explanatory 
variable is necessarily exogenous. 

In applied econometrics, endogeneity usually arises in one of three ways: 


Omitted Variables Omitted variables are an issue when we would like to control 
for one or more additional variables but, usually because of data unavailability, 
we cannot include them in a regression model. Specifically, suppose that E(y| x, q) is 
the conditional expectation of interest, which can be written as a function linear in 
parameters and additive in q. If q is unobserved, we can always estimate E(y |x), but 
this need have no particular relationship to E(y|x,q) when q and x are allowed to be 
correlated. One way to represent this situation is to write equation (4.1) where g is 


Single-Equation Linear Model and OLS Estimation 55 


part of the error term u. If q and x; are correlated, then x; is endogenous. The cor- 
relation of explanatory variables with unobservables is often due to self-selection: if 
agents choose the value of xj, this might depend on factors (q) that are unobservable 
to the analyst. A good example is omitted ability in a wage equation, where an indi- 
vidual’s years of schooling are likely to be correlated with unobserved ability. We 
discuss the omitted variables problem in detail in Section 4.3. 


Measurement Error In this case we would like to measure the (partial) effect of a 
variable, say x;, but we can observe only an imperfect measure of it, say xx. When 
we plug xx in for x;—thereby arriving at the estimable equation (4.1)—we neces- 
sarily put a measurement error into u. Depending on assumptions about how x% 
and xx are related, u and xg may or may not be correlated. For example, x; might 
denote a marginal tax rate, but we can only obtain data on the average tax rate. We 
will study the measurement error problem in Section 4.4. 


Simultaneity Simultaneity arises when at least one of the explanatory variables is 
determined simultaneously along with y. If, say, xx is determined partly as a function 
of y, then xx and u are generally correlated. For example, if y is city murder rate 
and xx is size of the police force, size of the police force is partly determined by the 
murder rate. Conceptually, this is a more difficult situation to analyze, because we 
must be able to think of a situation where we could vary xx exogenously, even though 
in the data that we collect y and xx are generated simultaneously. Chapter 9 treats 
simultaneous equations models in detail. 


The distinctions among the three possible forms of endogeneity are not always 
sharp. In fact, an equation can have more than one source of endogeneity. For ex- 
ample, in looking at the effect of alcohol consumption on worker productivity (as 
typically measured by wages), we would worry that alcohol usage is correlated with 
unobserved factors, possibly related to family background, that also affect wage; this 
is an omitted variables problem. In addition, alcohol demand would generally de- 
pend on income, which is largely determined by wage; this is a simultaneity problem. 
And measurement error in alcohol usage is always a possibility. For an illuminating 
discussion of the three kinds of endogeneity as they arise in a particular field, see 
Deaton’s (1995) survey chapter on econometric issues in development economics. 


4.2 Asymptotic Properties of Ordinary Least Squares 


We now briefly review the asymptotic properties of OLS for random samples from a 
population, focusing on inference. It is convenient to write the population equation 
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of interest in vector form as 
y=xß +u, (4.5) 


where x is a 1 x K vector of regressors and £ = (£1, $2,--. Bg) is a K x 1 vector. 
Because most equations contain an intercept, we will just assume that xı = 1, as this 
assumption makes interpreting the conditions easier. 

We assume that we can obtain a random sample of size N from the population in 
order to estimate $; thus, {(x;, y;): i= 1,2,...,N} are treated as independent, iden- 
tically distributed random variables, where x; is 1 x K and y; is a scalar. For each 
observation i we have 


Vi = XiB + ui, (4.6) 


which is convenient for deriving statistical properties of estimators. As for stating and 
interpreting assumptions, it is easiest to focus on the population model (4.5). 

Defining the vector of explanatory variables x; as a row vector is less popular than 
defining it as a column vector, but the row vector notation can be justified along 
many dimensions, especially when we turn to models with multiple equations or time 
periods. For now, we can justify the row vector notation by thinking about the most 
convenient, and by far the most common, method of entering and storing data—in 
tabular form, where the table has a row for each observation. In other words, if 
(x;, yi) are the outcomes for unit i, then it makes sense to view (x;, y;) as the ith row 
of our data matrix or table. Similarly, when we turn to estimation, it makes sense to 
define x; to be the ith row of the matrix of explanatory variables. 


4.2.1 Consistency 


As discussed in Section 4.1, the key assumption for OLS to consistently estimate f is 
the population orthogonality condition: 


ASSUMPTION OLS.1:  E(x’u) = 0. 


Because x contains a constant, Assumption OLS.1 is equivalent to saying that u 
has mean zero and is uncorrelated with each regressor, which is how we will refer to 
Assumption OLS.1. Sufficient for Assumption OLS.1 is the zero conditional mean 
assumption (4.3). 

It is critical to understand the population nature of Assumption OLS.1. The vector 
(x,u) represents a population, and OLS.1 is a restriction on the joint distribution in 
that population. For example, if x contains years of schooling and workforce expe- 
rience, and the main component of u is cognitive ability, then OLS.1 implies that 
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ability is uncorrelated with education and experience in the population; it has nothing 
to do with relationships in a sample of data. 

The other assumption needed for consistency of OLS is that the expected outer 
product matrix of x has full rank, so that there are no exact linear relationships 
among the regressors in the population. This is stated succinctly as follows: 


ASSUMPTION OLS.2: rank E(x’x) = K. 


As with Assumption OLS.1, Assumption OLS.2 is an assumption about the popu- 
lation. Because E(x’x) is a symmetric K x K matrix, Assumption OLS.2 is equiva- 
lent to assuming that E(x’x) is positive definite. Since x; = 1, Assumption OLS.2 is 
also equivalent to saying that the (population) variance matrix of the K — 1 non- 
constant elements in x is nonsingular. This is a standard assumption, which fails if 
and only if at least one of the regressors can be written as a linear function of the 
other regressors (in the population). Usually Assumption OLS.2 holds, but it can fail 
if the population model is improperly specified (for example, if we include too many 
dummy variables in x or mistakenly use something like log(age) and log(age*) in the 
same equation). 

Under Assumptions OLS.1 and OLS.2, the parameter vector £f is identified. In the 
context of models that are linear in the parameters under random sampling, identi- 
fication of $ simply means that f can be written in terms of population moments 
in observable variables. (Later, when we consider nonlinear models, the notion of 
identification will have to be more general. Also, special issues arise if we cannot 
obtain a random sample from the population, something we treat in Chapters 19 and 
20.) To see that $ is identified under Assumptions OLS.1 and OLS.2, premultiply 
equation (4.5) by x’, take expectations, and solve to get 


B = [E(x'x)] E(x" y). 


Because (x, y) is observed, f is identified. The analogy principle for choosing an esti- 
mator says to turn the population problem into its sample counterpart (see Gold- 
berger, 1968; Manski, 1988). In the current application this step leads to the method 
of moments: replace the population moments E(x’x) and E(x’y) with the corre- 
sponding sample averages. Doing so leads to the OLS estimator: 


N zl N N =l N 
p= (x Soxa) ~” Sox) =ß+ (m Soxa) (x S xu) , 
i=l i=l i=l i=l 


which can be written in full matrix form as (X'X)'X'Y, where X is the N x K data 
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matrix of regressors with ith row x; and Y is the N x 1 data vector with ith element 
y;. Under Assumption OLS.2, X’X is nonsingular with probability approaching one 
and plim[(N7! par x!x;) '] = A™!, where A = E(x’x) (see Corollary 3.1). Further, 
under Assumption OLS.1, plim(N~! yr 1 Xju;) = E(x'u) = 0. Therefore, by Slutsky’s 
theorem (Lemma 3.4), plim Ê = 6 + A~! - 0 = $. We summarize with a theorem: 


THEOREM 4.1 (Consistency of OLS): Under Assumptions OLS.1 and OLS.2, the 
OLS estimator # obtained from a random sample following the population model 
(4.5) is consistent for $. 


The simplicity of the proof of Theorem 4.1 should not undermine its usefulness. 
Whenever an equation can be put into the form (4.5) and Assumptions OLS.1 and 
OLS.2 hold, OLS using a random sample consistently estimates f. It does not matter 
where this equation comes from, or what the f; actually represent. As we will see in 
Sections 4.3 and 4.4, often an estimable equation is obtained only after manipulating 
an underlying structural equation. An important point to remember is that, once 
the linear (in parameters) equation has been specified with an additive error and 
Assumptions OLS.1 and OLS.2 are verified, there is no need to reprove Theorem 4.1. 

Under the assumptions of Theorem 4.1, xf is the linear projection of y on x. Thus, 
Theorem 4.1 shows that OLS consistently estimates the parameters in a linear pro- 
jection, subject to the rank condition in Assumption OLS.2. This is very general, as it 
places no restrictions on the nature of y—for example, y could be a binary variable 
or some other variable with discrete characteristics. Because a conditional expecta- 
tion that is linear in parameters is also the linear projection, Theorem 4.1 also shows 
that OLS consistently estimates conditional expectations that are linear in parame- 
ters. We will use this fact often in later sections. 

There are a few final points worth emphasizing. First, if either Assumption OLS.1 
or OLS.2 fails, then £ is not identified (unless we make other assumptions, as in 
Chapter 5). Usually it is correlation between u and one or more elements of x that 
causes lack of identification. Second, the OLS estimator is not necessarily unbiased 
even under Assumptions OLS.1 and OLS.2. However, if we impose the zero condi- 
tional mean assumption (4.3), then it can be shown that E(f|X) = £ if X’X is non- 
singular; see Problem 4.2. By iterated expectations, Ê is then also unconditionally 
unbiased, provided the expected value E() exists. 

Finally, we have not made the much more restrictive assumption that u and x are 
independent. If E(u) = 0 and u is independent of x, then assumption (4.3) holds, but 
not vice versa. For example, Var(u | x) is entirely unrestricted under assumption (4.3), 
but Var(u | x) is necessarily constant if u and x are independent. 
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4.2.2 Asymptotic Inference Using Ordinary Least Squares 


The asymptotic distribution of the OLS estimator is derived by writing 


N z N 
VN(B -B) = (yx) (ae ¥oxin). 
i=l i=l 


As we saw in Theorem 4.1, (N~! SX; xix)! — A7! =0,(1). Also, {(xju;):i = 
1,2,...} is an iid. sequence with zero mean, and we assume that each element 
has finite variance. Then the central limit theorem (Theorem 3.2) implies that 
NEN xl; £, Normal(0,B), where B is the K x K matrix 


B = E(u’x’x). (4.7) 


This implies N~'/? YA, x!u; = O,(1), and so we can write 
N 
VN(B— B) =A" (ae Satu) + op(1) (4.8) 
i=l 


because o,(1) -O,(1) = 0,(1). We can use equation (4.8) to immediately obtain the 
asymptotic distribution of VN(ĝ — f). A homoskedasticity assumption simplifies the 
form of OLS asymptotic variance: 


ASSUMPTION OLS.3: E(u*x'x) = o?E(x’x), where o? = E(u’). 


Because E(u) = 0, a” is also equal to Var(u). Assumption OLS.3 is the weakest form 
of the homoskedasticity assumption. If we write out the K x K matrices in Assump- 
tion OLS.3 element by element, we see that Assumption OLS.3 is equivalent to 
assuming that the squared error, u?, is uncorrelated with each xj, x7, and all cross 
products of the form x;x,. By the law of iterated expectations, sufficient for As- 
sumption OLS.3 is E(u?|x) =o, which is the same as Var(u|x) =o? when 
E(u|x) = 0. The constant conditional variance assumption for u given x is the easiest 
to interpret, but it is stronger than needed. 


THEOREM 4.2 (Asymptotic Normality of OLS): Under Assumptions OLS.1—OLS.3, 
VN(B —B) ~ Normal(0, a7[E(x’x)]~'). (4.9) 


Proof: From equation (4.8) and definition of B, it follows from Lemma 3.7 and 
Corollary 3.2 that 


VN(B—B) ~ Normal(0,A~'BA~'), 
where A = E(x’x). Under Assumption OLS.3, B = o7A, which proves the result. 
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Practically speaking, equation (4.9) allows us to treat Ê as approximately normal 
with mean £ and variance o?[E(x’x)|'/N. The usual estimator of o?, 6? = SSR/ 
(N — K), where SSR = >", û? is the OLS sum of squared residuals, is easily shown 
to be consistent. (Using N or N — K in the denominator does not affect consistency.) 
When we also replace E(x’x) with the sample average N~! ae xix; = (X’X/N), we 
get 


—_—. 


Avar(f) = 6?(X'X)'. (4.10) 


The right-hand side of equation (4.10) should be familiar: it is the usual OLS variance 
matrix estimator under the classical linear model assumptions. The bottom line of 
Theorem 4.2 is that, under Assumptions OLS.1—OLS.3, the usual OLS standard 
errors, t statistics, and F statistics are asymptotically valid. Showing that the F sta- 
tistic is approximately valid is done by deriving the Wald test for linear restrictions of 
the form Rf = r (see Chapter 3). Then the F statistic is simply a degrees-of-freedom- 
adjusted Wald statistic, which is where the F distribution (as opposed to the chi- 
square distribution) arises. 


4.2.3 Heteroskedasticity-Robust Inference 


If Assumption OLS.1 fails, we are in potentially serious trouble, as OLS is not even 
consistent. In the next chapter we discuss the important method of instrumental 
variables that can be used to obtain consistent estimators of f when Assumption 
OLS.1 fails. Assumption OLS.2 is also needed for consistency, but there is rarely any 
reason to examine its failure. 

Failure of Assumption OLS.3 has less serious consequences than failure of As- 
sumption OLS.1. As we have already seen, Assumption OLS.3 has nothing to do 
with consistency of £. Further, the proof of asymptotic normality based on equation 
(4.8) is still valid without Assumption OLS.3, but the final asymptotic variance is 
different. We have assumed OLS.3 for deriving the limiting distribution because it 
implies the asymptotic validity of the usual OLS standard errors and test statistics. 
Typically, regression packages assume OLS.3 as the default in reporting statistics. 

Often there are reasons to believe that Assumption OLS.3 might fail, in which case 
equation (4.10) is no longer a valid estimate of even the asymptotic variance matrix. 
If we make the zero conditional mean assumption (4.3), one solution to violation 
of Assumption OLS.3 is to specify a model for Var(y|x), estimate this model, and 
apply weighted least squares (WLS): for observation i, y; and every element of x; 
(including unity) are divided by an estimate of the conditional standard deviation 
[Var(y; | x;)]!/?, and OLS is applied to the weighted data (see Wooldridge (2009a, 
Chap. 8) for details). This procedure leads to a different estimator of f. We discuss 


Single-Equation Linear Model and OLS Estimation 61 


WLS in the more general context of nonlinear regression in Chapter 12. Lately, it 
has become more popular to estimate f by OLS even when heteroskedasticity is sus- 
pected but to adjust the standard errors and test statistics so that they are valid in the 
presence of arbitrary heteroskedasticity. Because these standard errors are valid 
whether or not Assumption OLS.3 holds, this method is easier than a weighted least 
squares procedure. What we sacrifice is potential efficiency gains from WLS (see 
Chapter 14). But, efficiency gains from WLS are guaranteed only if the model for 
Var(y|x) is correct (although gains can often be realized with a misspecified variance 
model). As a more subtle point, WLS is generally inconsistent if E(u|x) #0 but 
Assumption OLS.1 holds, so WLS is inappropriate for estimating linear projections. 
Especially with large sample sizes, the presence of heteroskedasticity need not affect 
one’s ability to perform accurate inference using OLS. But we need to compute 
standard errors and test statistics appropriately. 

The adjustment needed to the asymptotic variance follows from the proof of The- 
orem 4.2: without OLS.3, the asymptotic variance of f is Avar(ĝ) = A-'BA7!/N, 
where the K x K matrices A and B were defined earlier. We already know how 
to consistently estimate A oo of B is also straightforward. First, by the law 
of large numbers, N~! YX] u?x!x; 2, E(u2x'x) = B. Now, since the u; are not 
observed, we replace u; wn the OLS residual a; = y; — x;ĝ. This leads to the con- 
sistent estimator B = N~! 57, a2?x!x;. See White (2001) and Problem 4.4. 

The heteroskedasticity-robust variance matrix estimator of B is A~'BA~!/N or, 
after cancellations involving the sample sizes, 


Avar() = xxr'( di) xj". (4.11) 


This matrix was introduced in econometrics by White (1980b), although some attri- 
bute it to either Eicker (1967) or Huber (1967), statisticians who discovered robust 
variance matrices. The square roots of the diagonal elements of equation (4.11) are 
often called the White standard errors or Huber standard errors, or some hyphenated 
combination of the names Eicker, Huber, and White. It is probably best to just call 
them heteroskedasticity-robust standard errors, since this term describes their purpose. 
Remember, these standard errors are asymptotically valid in the presence of any kind 
of heteroskedasticity, including homoskedasticity. 

Robust standard errors are often reported in applied cross-sectional work, espe- 
cially when the sample size is large. Sometimes they are reported along with the 
usual OLS standard errors; sometimes they are presented in place of them. Several 
regression packages now report these standard errors as an option, so it is easy to 
obtain heteroskedasticity-robust standard errors. 


62 Chapter 4 


Sometimes, as a degrees-of-freedom correction, the matrix in equation (4.11) is 
multiplied by N/(N — K). This procedure guarantees that, if the a? were constant 
across i (an unlikely event in practice, but the strongest evidence of homoskedasticity 
possible), then the usual OLS standard errors would be obtained. There is some evi- 
dence that the degrees-of-freedom adjustment improves finite sample performance. 
There are other ways to adjust equation (4.11) to improve its small-sample properties— 
see, for example, MacKinnon and White (1985)—but if N is large relative to K, these 
adjustments typically make little difference. 

Once standard errors are obtained, ¢ statistics are computed in the usual way. 
These are robust to heteroskedasticity of unknown form, and can be used to test 
single restrictions. The ¢ statistics computed from heteroskedasticity robust standard 
errors are heteroskedasticity-robust ¢ statistics. Confidence intervals are also obtained 
in the usual way. 

When Assumption OLS.3 fails, the usual F statistic is not valid for testing multiple 
linear restrictions, even asymptotically. Many packages allow robust testing with a 
simple command. If the hypotheses are written as 


Ho: RB =r, (4.12) 


where R is Q x K and has rank Q < K, and r is Q x 1, then the heteroskedasticity- 
robust Wald statistic for testing equation (4.12) is 


W = (RÊ —r)'(RVR’)'(RB—r), (4.13) 


where V is given in equation (4.11). Under Hy, W ~ Xo- The Wald statistic can be 
turned into an approximate Fo y—x random variable by dividing it by Q (and usu- 
ally making the degrees-of-freedom adjustment to V). But there is nothing wrong 
with using equation (4.13) directly. 


4.2.4 Lagrange Multiplier (Score) Tests 
In the partitioned model 
Y = XB, + Xp, + u, (4.14) 


under Assumptions OLS.1-OLS.3, where x; is 1 x Kı and x2 is 1 x K2, we know that 
the hypothesis Ho: $> = 0 is easily tested (asymptotically) using a standard F test. 
There is another approach to testing such hypotheses that is sometimes useful, espe- 
cially for computing heteroskedasticity-robust tests and for nonlinear models. 

Let f, be the estimator of f, under the null hypothesis Ho: B, = 0; this is called 
the estimator from the restricted model. Define the restricted OLS residuals as ù; = 
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yi- xi B,, i=1,2,...,N. Under Ho, xn should be, up to sample variation, uncor- 
related with ù; in the sample. The Lagrange multiplier or score principle is based on 
this observation. It turns out that a valid test statistic is obtained as follows: Run the 
OLS regression 


ü on X1, X2 (4.15) 


(where the observation index i has been suppressed). Assuming that x; contains a 
constant (that is, the null model contains a constant), let R? denote the usual R- 
squared from the regression (4.15). Then the Lagrange multiplier (LM) or score sta- 
tistic is LM = NR2. These names come from different features of the constrained 
optimization problem; see Rao (1948), Aitchison and Silvey (1958), and Chapter 
12. Because of its form, LM is also referred to as an N-R-squared test. Under Ho, 
LM ~ x}, where K is the number of restrictions being tested. If NR? is suffi- 
ciently large, then ü is significantly correlated with x2, and the null hypothesis will be 
rejected. 

It is important to include x; along with x3 in regression (4.15). In other words, the 
OLS residuals from the null model should be regressed on all explanatory variables, 
even though ŭ is orthogonal to x; in the sample. If x; is excluded, then the resulting 
statistic generally does not have a chi-square distribution when x and x, are corre- 
lated. If E(x}x2) = 0, then we can exclude x; from regression (4.15), but this ortho- 
gonality rarely holds in applications. If x; does not include a constant, R? should be 
the uncentered R-squared: the total sum of squares in the denominator is obtained 
without demeaning the dependent variable, #. When x; includes a constant, the usual 
centered R-squared and uncentered R-squared are identical because 55A} ñ; = 0. 


Example 4.1 (Wage Equation for Married, Working Women): Consider a wage 
equation for married, working women: 


log(wage) = By + B exper + Brexper? + B3educ 
+ Byage + Bskidslt6 + Bgkidsge6 + u, (4.16) 


where the last three variables are the woman’s age, number of children less than six, 
and number of children at least six years of age, respectively. We can test whether, 
after the productivity variables experience and education are controlled for, women 
are paid differently depending on their age and number of children. The F statistic for 
the hypothesis Ho: 8, = 0,8; = 0,8; = 0 is F = [(R2. — R?)/(1 — R2.)] - [((N — 7)/3], 
where RŽ, and R? are the unrestricted and restricted R-squareds; under Ho (and 


homoskedasticity), F ~ Fz y-7. To obtain the LM statistic, we estimate the equation 
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without age, kidslt6, and kidsge6; let u denote the OLS residuals. Then, the LM sta- 
tistic is NR? from the regression & on 1, exper, exper”, educ, age, kidslt6, and kidsge6, 
where the 1 denotes that we include an intercept. Under Hp and homoskedasticity, 
NR, ~ 23. 

Using the data on the 428 working, married women in MROZ.RAW (from Mroz, 
1987), we obtain the following estimated equation: 


log(wage) = —.421 + .040 exper — .00078 exper? + .108 educ 


(.317) (.013) (.00040) (.014) 
[.318] — [.015] [.00041] [.014] 
— .0015 age — .061 kidslt6 — .015 kidsge6, R? = .158 
(.0053) (.089) (.028) 
[.0059] [.106] [.029] 


where the quantities in brackets are the heteroskedasticity-robust standard errors. 
The F statistic for joint significance of age, kids/t6, and kidsge6 turns out to be about 
.24, which gives p-value x .87. Regressing the residuals u from the restricted model 
on all exogenous variables gives an R-squared of .0017, so LM = 428(.0017) = .728, 
and p-value x .87. Thus, the F and LM tests give virtually identical results. 


The test from regression (4.15) maintains Assumption OLS.3 under Ho, just like 
the usual F test. It turns out to be easy to obtain a heteroskedasticity-robust LM 
statistic. To see how to do so, let us look at the formula for the LM statistic from 
regression (4.15) in more detail. After some algebra we can write 


N 1 =I N 
LM = (eye) GE J (aera), 
i=l i=1 


where G2 = N7! pe ,u? and each f; is a 1 x K> vector of OLS residuals from the 
(multivariate) regression of xz on x, i= 1,2,...,N. This statistic is not robust to 
heteroskedasticity because the matrix in the middle is not a consistent estimator of 
the asymptotic variance of (N~!/? YOA; #/a;) under heteroskedasticity. Following the 


reasoning in Section 4.2.3, a heteroskedasticity-robust statistic is 


N l N =l N 
(1° a) («Sa (meS ea) 
l i=] i=l 


N '/N “17 N 

= Pe Jala aja 

= y PU; y u; TT; y ru; |. 
i=1 i=1 i=l 


LM 
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Dropping the 7 subscript, this is easily obtained, as N — SSRo from the OLS regres- 
sion (without an intercept) 


l on ù- f, (4.17) 


where ù -f= (ù - f1, Ù: f2,..., Ù- fpg) is the 1 x Ky vector obtained by multiplying ŭ 
by each element of f and SSRo is just the usual sum of squared residuals from re- 
gression (4.17). Thus, we first regress each element of x2 onto all of x; and collect the 
residuals in f. Then we form a - f (observation by observation) and run the regression 
in (4.17); M — SSRo from this regression is distributed asymptotically as xj. (Do not 
be thrown off by the fact that the dependent variable in regression (4.17) is unity for 
each observation; a nonzero sum of squared residuals is reported when you run OLS 
without an intercept.) For more details, see Davidson and MacKinnon (1985, 1993) 
or Wooldridge (1991a, 1995b). 


Example 4.1 (continued): To obtain the heteroskedasticity-robust LM statistic for 
Ho: £, = 0,8; = 0,85 =0 in equation (4.16), we estimate the restricted model as 
before and obtain ŭ. Then we run the regressions (1) age on 1, exper, exper”, educ; (2) 
kidslt6 on 1, exper, exper”, educ; (3) kidsge6 on 1, exper, exper, educ; and obtain the 
residuals 7), 72, and 73, respectively. The LM statistic is N — SSRo from the regression 
1 on m+ îi, ù: 7, ù- f3, and N — SSRo A 73. 

When we apply this result to the data in MROZ.RAW we get LM = .51, which 
is very small for a y? random variable: p-value ~ .92. For comparison, the hetero- 
skedasticity-robust Wald statistic (scaled by Stata to have an approximate F distri- 
bution) also yields p-value ~ .92. 


4.3 Ordinary Least Squares Solutions to the Omitted Variables Problem 


4.3.1 Ordinary Least Squares Ignoring the Omitted Variables 


Because it is so prevalent in applied work, we now consider the omitted variables 
problem in more detail. A model that assumes an additive effect of the omitted vari- 
able is 


E(y | x1, x2,..., XK, q) = Po + Bix + Paxe + 3 ‘oot Bear + 74, (4.18) 


where q is the omitted factor. In particular, we are interested in the £;, which are the 
partial effects of the observed explanatory variables, holding the other explanatory 
variables constant, including the unobservable q. In the context of this additive 
model, there is no point in allowing for more than one unobservable; any omitted 
factors are lumped into q. Henceforth we simply refer to q as the omitted variable. 
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A good example of equation (4.18) is seen when y is log(wage) and q includes 
ability. If xx denotes a measure of education, fx in equation (4.18) measures the 
partial effect of education on wages controlling for—or holding fixed—the level of 
ability (as well as other observed characteristics). This effect is most interesting from 
a policy perspective because it provides a causal interpretation of the return to edu- 
cation: fx is the expected proportionate increase in wage if someone from the work- 
ing population is exogenously given another year of education. 

Viewing equation (4.18) as a structural model, we can always write it in error form 


as 
y = Po + 1x1 + Box. +--+ + PXK + yq +v, (4.19) 
E(v| x1, %2,.-.,%x,q) = 0, (4.20) 


where v is the structural error. One way to handle the nonobservability of q is to put 
it into the error term. In doing so, nothing is lost by assuming E(q) = 0, because an 
intercept is included in equation (4.19). Putting g into the error term means we re- 
write equation (4.19) as 


y = Po + Bix + Box. +++: + Bex +u, (4.21) 
u= yq +v. (4.22) 


The error u in equation (4.21) consists of two parts. Under equation (4.20), v has zero 
mean and is uncorrelated with x1, x2,...,Xg (and q). By normalization, q also has 
zero mean. Thus, E(u) = 0. However, u is uncorrelated with x; if and only if q is 
uncorrelated with x;. If q is correlated with any of the regressors, then so is u, and we 
have an endogeneity problem. We cannot expect OLS to consistently estimate any p;. 
Although E(u|x) # E(u) in equation (4.21), the 2, do have a structural interpretation 
because they appear in equation (4.19). 

It is easy to characterize the plims of the OLS estimators when the omitted variable 
is ignored; we will call this the OLS omitted variables inconsistency or OLS omitted 
variables bias (even though the latter term is not always precise). Write the linear 
projection of q onto the observable explanatory variables as 


q = ĉo + O,X, +++: +OKxXK +r, (4.23) 


where, by definition of a linear projection, E(r) = 0, Cov(x;,r) = 0, 7 = 1,2,...,K. 
The parameter ô; measures the relationship between q and x; after “partialing out” 
the other x}. Then we can easily infer the plim of the OLS estimators from regressing 
y onto 1, x),...,xx by finding an equation that does satisfy Assumptions OLS.1 and 
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OLS.2. Plugging equation (4.23) into equation (4.19) and doing simple algrebra gives 
y = (Bo + 760) + (Bi + 761) x1 + (By + 762)%2 + +++ + (Pr + OK)XK + 0+ yr 


Now, the error v + yr has zero mean and is uncorrelated with each regressor. It fol- 
lows that we can just read off the plim of the OLS estimators from the regression of y 
on 1,x1,...,X«: plim Ê; = Êb; + yô;. Sometimes it is assumed that most of the ô; are 
zero. When the correlation between q and a particular variable, say xx, is the focus, 
a common (usually implicit) assumption is that all ô; in equation (4.23) except the 
intercept and coefficient on xx are zero. Then plim Ê; =, j=1,...,K— 1, and 


plim Bx = Bg + 7[Cov(xx, q)/Var(xx)] (4.24) 


(because 0x = Cov(xx,q)/Var(xx) in this case). This formula gives us a simple way 
to determine the sign, and perhaps the magnitude, of the inconsistency in B Kx fy > 0 
and xx and q are positively correlated, the asymptotic bias is positive. The other 
combinations are easily worked out. If xx has substantial variation in the population 
relative to the covariance between xx and q, then the bias can be small. In the general 
case of equation (4.23), it is difficult to sign ôx because it measures a partial correla- 
tion. It is for this reason that 6; = 0, j = 1,...,K — 1 is often maintained for deter- 
mining the likely asymptotic bias in Êk when only xx is endogenous. 


Example 4.2 (Wage Equation with Unobserved Ability): Write a structural wage 
equation explicitly as 


log(wage) = By + By exper + B exper? + Bzeduc + y abil + v, 


where v has the structural error property E(v | exper, educ, abil) = 0. If abil is uncor- 
related with exper and exper? once educ has been partialed out—that is, abil = ôo + 
d3educ +r with r uncorrelated with exper and exper?—then plim ĝ; = f; + 703. 
Under these assumptions, the coefficients on exper and exper? are consistently esti- 
mated by the OLS regression that omits ability. If 63 > 0 then plim Bs > f, (because 
y > 0 by definition), and the return to education is likely to be overestimated in large 
samples. 


4.3.2 Proxy Variable—Ordinary Least Squares Solution 


Omitted variables bias can be eliminated, or at least mitigated, if a proxy variable is 
available for the unobserved variable g. There are two formal requirements for a 
proxy variable for q. The first is that the proxy variable should be redundant (some- 
times called ignorable) in the structural equation. If z is a proxy variable for q, then 
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the most natural statement of redundancy of z in equation (4.18) is 


E(y|x, 9,2) = E(y|x, 9). (4.25) 


Condition (4.25) is easy to interpret: z is irrelevant for explaining y, in a conditional 
mean sense, once x and q have been controlled for. This assumption on a proxy 
variable is virtually always made (sometimes only implicitly), and it is rarely contro- 
versial: the only reason we bother with z in the first place is that we cannot get data 
on q. Anyway, we cannot get very far without condition (4.25). In the wage—education 
example, let q be ability and z be IQ score. By definition it is ability that affects wage: 
IQ would not matter if true ability were known. 

Condition (4.25) is somewhat stronger than needed when unobservables appear 
additively as in equation (4.18); it suffices to assume that v in equation (4.19) is 
simply uncorrelated with z. But we will focus on condition (4.25) because it is natu- 
ral, and because we need it to cover models where q interacts with some observed 
covariates. 

The second requirement of a good proxy variable is more complicated. We require 
that the correlation between the omitted variable g and each x; be zero once we par- 
tial out z. This is easily stated in terms of a linear projection: 


L(g| 1 x1,-+-5«,2) = L(g 1,2). (4.26) 


It is also helpful to see this relationship in terms of an equation with an unobserved 
error. Write q as a linear function of z and an error term as 


q= +0z+r, (4.27) 


where, by definition, E(r) = 0 and Cov(z,r) = 0 because 0o + 012 is the linear pro- 
jection of q on 1, z. If z is a reasonable proxy for q, 01 # 0 (and we usually think in 
terms of 0; > 0). But condition (4.26) assumes much more: it is equivalent to 


Cov(x;, r) = 0, JH Qh cca K- 


This condition requires z to be closely enough related to q so that once it is included 
in equation (4.27), the x; are not partially correlated with q. 

Before showing why these two proxy variable requirements do the trick, we should 
head off some possible confusion. The definition of proxy variable here is not uni- 
versal. While a proxy variable is always assumed to satisfy the redundancy condition 
(4.25), it is not always assumed to have the second property. In Chapter 5 we will use 
the notion of an indicator of q, which satisfies condition (4.25) but not the second 
proxy variable assumption. 
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To obtain an estimable equation, replace q in equation (4.19) with equation (4.27) 
to get 


Y = (Bo + V80) + bixi +-+: + Bexx + yOz + (yr + v). (4.28) 


Under the assumptions made, the composite error term u = yr + v is uncorrelated 
with x; for all j; redundancy of z in equation (4.18) means that z is uncorrelated with 
v and, by definition, z is uncorrelated with r. It follows immediately from Theorem 
4.1 that the OLS regression y on 1, x1, X2,..., Xg, Z produces consistent estimators of 
(Bo + x00), P1; P2,- --, Bg, and y01. Thus, we can estimate the partial effect of each of 
the x; in equation (4.18) under the proxy variable assumptions. 

When z is an imperfect proxy, then r in equation (4.27) is correlated with one or 
more of the xj. Generally, when we do not impose condition (4.26) and write the 
linear projection as 


q = bo + pix +-+: + PexK t+ Oz +7, 


the proxy variable regression gives plim Ê; = p; + yp;. Thus, OLS with an imperfect 
proxy is inconsistent. The hope is that the p, are smaller in magnitude than if z were 
omitted from the linear projection, and this can usually be argued if z is a reasonable 
proxy for q; but see the end of this subsection for further discussion. 

If including z induces substantial collinearity, it might be better to use OLS with- 
out the proxy variable. However, in making these decisions we must recognize that 
including z reduces the error variance if 0; # 0: Var(yr + v) < Var(yq + v) because 
Var(r) < Var(q), and v is uncorrelated with both r and q. Including a proxy variable 
can actually reduce asymptotic variances as well as mitigate bias. 


Example 4.3 (Using IQ as a Proxy for Ability): We apply the proxy variable 
method to the data on working men in NLS80.RAW, which was used by Blackburn 
and Neumark (1992), to estimate the structural model 


log(wage) = By + P; exper + f, tenure + p, married 
+ pa south + p; urban + ps black + Bz educ + y abil + v, (4.29) 


where exper is labor market experience, married is a dummy variable equal to unity if 
married, south is a dummy variable for the southern region, urban is a dummy vari- 
able for living in an SMSA, black is a race indicator, and educ is years of schooling. 
We assume that JQ satisfies the proxy variable assumptions: in the linear projection 
abil = 0) + 0,10 + r, where r has zero mean and is uncorrelated with JQ, we also 
assume that r is uncorrelated with experience, tenure, education, and other factors 
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appearing in equation (4.29). The estimated equations without and with JQ are 


log (wage) = 5.40 + .014exper+ .012 tenure+ .199 married 
(0.11) (.003) (.002) (.039) 


— .091 south+ .184 urban — .188 black + .065 educ. 
(.026) (.027) (.038) (.006) 


N = 935, R? = .253 


log (wage) = 5.18 + .014exper+ .011 tenure + .200 married 


(0.13) (.003) (.002) (.039) 

— .080 south + .182 urban — .143 black + .054 educ 
(.026) (.027) (.039) (.007) 

+ .0036 JQ. 
(.0010) 


N = 935, R? = .263 


Notice how the return to schooling has fallen from about 6.5 percent to about 5.4 
percent when /Q is added to the regression. This is what we expect to happen if 
ability and schooling are (partially) positively correlated. Of course, these are just 
the findings from one sample. Adding JQ explains only one percentage point more of 
the variation in log(wage), and the equation predicts that 15 more JO points (one 
standard deviation) increases wage by about 5.4 percent. The standard error on the 
return to education has increased, but the 95 percent confidence interval is still fairly 
tight. 


Often the outcome of the dependent variable from an earlier time period can be a 
useful proxy variable. 


Example 4.4 (Effects of Job Training Grants on Worker Productivity): The data in 
JTRAINI.RAW are for 157 Michigan manufacturing firms for the years 1987, 1988, 
and 1989. These data are from Holzer, Block, Cheatham, and Knott (1993). The goal 
is to determine the effectiveness of job training grants on firm productivity. For this 
exercise, we use only the 54 firms in 1988 that reported nonmissing values of the 
scrap rate (number of items out of 100 that must be scrapped). No firms were 
awarded grants in 1987; in 1988, 19 of the 54 firms were awarded grants. If the 
training grant has the intended effect, the average scrap rate should be lower among 
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firms receiving a grant. The problem is that the grants were not randomly assigned: 
whether or not a firm received a grant could be related to other factors unobservable 
to the econometrician that affect productivity. In the simplest case, we can write (for 
the 1988 cross section) 


log(scrap) = By + Bygrant + yq + v, 


where v is orthogonal to grant but q contains unobserved productivity factors that 
might be correlated with grant, a binary variable equal to unity if the firm received a 
job training grant. Since we have the scrap rate in the previous year, we can use 
log(scrap_,) as a proxy variable for q: 


q = Qo + 01 log(scrap_\) +r, 


where r has zero mean and, by definition, is uncorrelated with log(scrap_). We 
hope that r has no or little correlation with grant. Plugging in for q gives the estimable 
model 


log(scrap) = ôo + B, grant + yO, log(scrap_;) +r + v. 


From this equation, we see that f; measures the proportionate difference in scrap 
rates for two firms having the same scrap rates in the previous year, but where one 
firm received a grant and the other did not. This is intuitively appealing. The esti- 
mated equations are 


log(scrap) = .409 + .057 grant. 
(.240) (.406) 


N=54, R*=.0004 


log(scrap) = 021 — .254 grant+ .831 log(scrap_;). 
(.089) (.147) (.044) 


N = 54, R? = 873 


Without the lagged scrap rate, we see that the grant appears, if anything, to reduce 
productivity (by increasing the scrap rate), although the coefficient is statistically in- 
significant. When the lagged dependent variable is included, the coefficient on grant 
changes signs, becomes economically large—firms awarded grants have scrap rates 
about 25.4 percent less than those not given grants—and the effect is significant at the 
5 percent level against a one-sided alternative. (The more accurate estimate of the 
percentage effect is 100 - [exp(—.254) — 1] = —22.4%; see Problem 4.1(a).) 
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We can always use more than one proxy for xx. For example, it might be that 
E(q| x, 21,22) = E(q| 21, 22) = 00 + 0121 + 0222, in which case including both zı and 
Z2 as regressors along with x,,...,xx solves the omitted variable problem. The 
weaker condition that the error r in the equation q = Oo + 0,2; + 0222 + r is uncor- 
related with x;,...,xx also suffices. 

The data set NLS80.RAW also contains each man’s score on the knowledge of 
the world of work (KWW) test. Problem 4.11 asks you to reestimate equation (4.29) 
when KWW and IQ are both used as proxies for ability. 

Before ending this subsection, it is useful to formally show how not all variables 
are suitable proxy variables, in the sense that adding them to a regression can actu- 
ally leave us worse off than excluding them. Intuitively, variables that have low cor- 
relation with the omitted variable make poor proxies. For illustration, consider a 
simple regression model and the extreme case where a proposed proxy variable is 
uncorrelated with the error term: 


y=Pot+fhixt+u 
E(u) = 0, Cov(z,u) = 0, 


where all quantities are scalars and u includes the omitted variable. Let 8, denote the 
OLS slope estimator from regressing y on 1, x and let Êi be the slope on x from 
the regression y on 1, x, z (using a random sample of size N in each case). From the 
omitted variable inconsistency formula (4.24), we know that plim(f,) = £; + 
Cov(x,u)/Var(x). Further, from the two-step projection result (Property LP.7 in 
Chapter 2), we can easily find the plim of £,. Let a = x —L(x|1,z) = X — To — 71Z 
be the population residual from linearly projecting x onto z. Then plim(£,) = 
pı + Cov(a,u)/Var(a). Therefore, the absolute values of the inconsistencies are 
|Cov(x, u)|/Var(x) and |Cov(a,u)|/Var(a), respectively. Now, because z is uncorre- 
lated with u, Cov(a,u) = Cov(x, u), and so the numerators in the inconsistency terms 
are the same. Further, by a standard property of a population residual, Var(a) < 
Var(x), with strict inequality unless z is also uncorrelated with x. We have shown 
that |plim(f,) — B,| > |plim(Z,) — £, | whenever Cov(z,u) = 0 and Cov(z, x) 40. In 
other words, OLS without the proxy has less inconsistency than OLS with the proxy 
(unless x is uncorrelated with u, too). The proxy variable estimator also has a larger 
asymptotic variance. 

As we will see in Chapter 5, variables uncorrelated with u and correlated with x are 
very useful for identifying f}, but they are not used as additional regressors in an 
OLS regression. 
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4.3.3 Models with Interactions in Unobservables: Random Coefficient Models 


In some cases we might be concerned about interactions between unobservables and 
observable explanatory variables. Obtaining consistent estimators is more difficult in 
this case, but a good proxy variable can again solve the problem. 

Write the structural model with unobservable q as 


Y = Po + Bix t+ + BexK + 119+ rxKG +», (4.30) 
where we make a zero conditional mean assumption on the structural error v: 
E(v|x,qg) = 0. (4.31) 


For simplicity we have interacted q with only one explanatory variable, xg. 

Before discussing estimation of equation (4.30), we should have an interpretation 
for the parameters in this equation, as the interaction xxq is unobservable. (We dis- 
cussed this topic more generally in Section 2.2.5.) If xx is an essentially continuous 
variable, the partial effect of xx on E(y |x, q) is 


Ely |x, q) 


= ; 4.32 
a Pr +4 ( ) 


Thus, the partial effect of xg actually depends on the level of q. Because q is not 
observed for anyone in the population, equation (4.32) can never be estimated, even 
if we could estimate y, (which we cannot, in general). But we can average equation 
(4.32) across the population distribution of q. Assuming E(q) = 0, the average partial 
effect (APE ) of xx is 


E(Bx +724) = Pr- (4.33) 


A similar interpretation holds for discrete xx. For example, if xx is binary, then 
E(y|™1,---,XK-1,1,¢) — E(y| x1, ..-,XK-1,0,q) = fg + 724, and fg is the average 
of this difference over the distribution of q. In this case, Px is called the average 
treatment effect (ATE). This name derives from the case where xg represents receiv- 
ing some “treatment,” such as participation in a job training program or partici- 
pation in a school voucher program. We will consider the binary treatment case 
further in Chapter 19, where we introduce a counterfactual framework for estimating 
average treatment effects. 

It turns out that the assumption E(q) = 0 is without loss of generality. Using sim- 
ple algebra we can show that, if 4, = E(q) #0, then we can consistently estimate 
Bx + Yq, Which is the average partial effect. 
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The model in equation (4.30) is sometimes called a random coefficient model 
because (in this case) one of the slope coefficients is “random’’—that is, it depends 
on invidual-specific unobserved heterogeneity. One way to write the equation for a 
random draw iis y; = Po + bixa +-+- + Be Xi x1 + bikXik + ui, where big = Pg + 
yəqi is the random coefficient. Note that all other slope coefficients are assumed to be 
constant. Regardless of what we label the model, we are typically interested in esti- 
mating fg = E(bix). 

If the elements of x are exogenous in the sense that E(q |x) = 0, then we can con- 
sistently estimate each of the f; by an OLS regression, where q and xxq are just part 
of the error term. This result follows from iterated expectations applied to equation 
(4.30), which shows that E(y|x) =f)+f,x1+---+Bexx if E(q|x) =0. The 
resulting equation probably has heteroskedasticity, but this is easily dealt with. Inci- 
dentally, this is a case where only assuming that q and x are uncorrelated would not 
be enough to ensure consistency of OLS: xgq and x can be correlated even if q and x 
are uncorrelated. 

If q and x are correlated, we can consistently estimate the J; by OLS if we have a 
suitable proxy variable for g. We still assume that the proxy variable, z, satisfies the 
redundancy condition (4.25). In the current model we must make a stronger proxy 
variable assumption than we did in Section 4.3.2: 


E(q|x,z) = E(q|z) = &z, (4.34) 


where now we assume z has a zero mean in the population. Under these two proxy 
variable assumptions, iterated expectations gives 


E(y|x, z) = Bo + 81x1 +--+ + gxr +9012 + y201XKZ, (4.35) 


and the parameters are consistently estimated by OLS. 

If we do not define our proxy to have zero mean in the population, then estimating 
equation (4.35) by OLS does not consistently estimate fx. If E(z) # 0, then we would 
have to write E(q|z) = 0) + @\z, in which case the coefficient on xx in equation 
(4.35) would be Bx + 007. In practice, we may not know the population mean of the 
proxy variable, in which case the proxy variable should be demeaned in the sample 
before interacting it with xx. 

If we maintain homoskedasticity in the structural model—that is, Var(y|x,q,z) = 
Var(y|x,q) =o?—then there must be heteroskedasticity in Var(y|x,z). Using 
Property CV.3 in Appendix 2A, it can be shown that 


Var(y|x,z) = 0? + (yı + xx)? Var(q|x, z). 
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Even if Var(q|x,z) is constant, Var(y|x,z) depends on xx. This situation is most 
easily dealt with by computing heteroskedasticity-robust statistics, which allows for 
heteroskedasticity of arbitrary form. 


Example 4.5 (Return to Education Depends on Ability): Consider an extension of 
the wage equation (4.29): 


log(wage) = By + P exper + Pytenure + P,married + B4south 
+ fsurban + Peblack + p educ + y,abil + y,educ-abil + v (4.36) 


so that educ and abil have separate effects but also have an interactive effect. In this 
model the return to a year of schooling depends on abil: P} + y,abil. Normalizing abil 
to have zero population mean, we see that the average of the return to education is 
simply £7. We estimate this equation under the assumption that JQ is redundant 
in equation (4.36) and E(abil| x, 7Q) = E(abil | IQ) = 0; 1Q — 100) = 0;7Qo, where 
[Qo is the population-demeaned JQ (IQ is constructed to have mean 100 in the pop- 
ulation). We can estimate the #; in equation (4.36) by replacing abil with JQ) and 
educ-abil with educ-IQo and doing OLS. 
Using the sample of men in NLS80.RAW gives the following: 


log(wage) =... + .052 educ— .00094 [09 + .00034 educ - IQ 
(.007) (.00516) (.00038) 


N = 935, R? = .263 


where the usual OLS standard errors are reported (if y, = 0, homoskedasticity may 
be reasonable). The interaction term educ-IQp is not statistically significant, and the 
return to education at the average IQ, 5.2 percent, is similar to the estimate when the 
return to education is assumed to be constant. Thus there is little evidence for an in- 
teraction between education and ability. Incidentally, the F test for joint significance 
of [Qo and educ-IQp yields a p-value of about .0011, but the interaction term is not 
needed. 


In this case, we happen to know the population mean of JQ, but in most cases we 
will not know the population mean of a proxy variable. Then, we should use the 
sample average to demean the proxy before interacting it with xx; see Problem 4.8. 
Technically, using the sample average to estimate the population average should be 
reflected in the OLS standard errors. But, as you are asked to show in Problem 6.10 
in Chapter 6, the adjustments generally have very small impacts on the standard 
errors and can safely be ignored. 
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In his study on the effects of computer usage on the wage structure in the United 
States, Krueger (1993) uses computer usage at home as a proxy for unobservables 
that might be correlated with computer usage at work; he also includes an interaction 
between the two computer usage dummies. Krueger does not demean the “uses 
computer at home” dummy before constructing the interaction, so his estimate on 
“uses a computer at work” does not have an average treatment effect interpreta- 
tion. However, just as in Example 4.5, Krueger found that the interaction term is 
insignificant. 


4.4 Properties of Ordinary Least Squares under Measurement Error 


As we saw in Section 4.1, another way that endogenous explanatory variables can 
arise in economic applications occurs when one or more of the variables in our model 
contains measurement error. In this section, we derive the consequences of measure- 
ment error for ordinary least squares estimation. 

The measurement error problem has a statistical structure similar to the omitted 
variable—proxy variable problem discussed in the previous section. However, they are 
conceptually very different. In the proxy variable case, we are looking for a variable 
that is somehow associated with the unobserved variable. In the measurement error 
case, the variable that we do not observe has a well-defined, quantitative meaning 
(such as a marginal tax rate or annual income), but our measures of it may contain 
error. For example, reported annual income is a measure of actual annual income, 
whereas IQ score is a proxy for ability. 

Another important difference between the proxy variable and measurement error 
problems is that, in the latter case, often the mismeasured explanatory variable is the 
one whose effect is of primary interest. In the proxy variable case, we cannot estimate 
the effect of the omitted variable. 

Before we turn to the analysis, it is important to remember that measurement error 
is an issue only when the variables on which we can collect data differ from the vari- 
ables that influence decisions by individuals, families, firms, and so on. For example, 
suppose we are estimating the effect of peer group behavior on teenage drug usage, 
where the behavior of one’s peer group is self-reported. Self-reporting may be a mis- 
measure of actual peer group behavior, but so what? We are probably more inter- 
ested in the effects of how a teenager perceives his or her peer group. 


4.4.1 Measurement Error in the Dependent Variable 


We begin with the case where the dependent variable is the only variable measured 
with error. Let y* denote the variable (in the population, as always) that we would 
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like to explain. For example, y* could be annual family saving. The regression model 
has the usual linear form 


y* = Po + Bix +--+ +Bexx +v (4.37) 


and we assume that it satisfies at least Assumptions OLS.1 and OLS.2. Typically, we 
are interested in E(y* |x1,..., xg). We let y represent the observable measure of y* 
where y # y*. 

The population measurement error is defined as the difference between the ob- 
served value and the actual value: 


eo = y— y*. (4.38) 


For a random draw i from the population, we can write e; = y; — y;, but what is 
important is how the measurement error in the population is related to other factors. 
To obtain an estimable model, we write y* = y — eo, plug this into equation (4.37), 
and rearrange: 


Y=Bo+ Bix +--+ BexK + 0+ eo. (4.39) 


Since y, x1, X2,..., Xg are observed, we can estimate this model by OLS. In effect, we 
just ignore the fact that y is an imperfect measure of y* and proceed as usual. 

When does OLS with y in place of y* produce consistent estimators of the f;? 
Since the original model (4.37) satisfies Assumption OLS.1, v has zero mean and is 
uncorrelated with each x;. It is only natural to assume that the measurement error 
has zero mean; if it does not, this fact only affects estimation of the intercept, fp. 
Much more important is what we assume about the relationship between the mea- 
surement error e9 and the explanatory variables x;. The usual assumption is that 
the measurement error in y is statistically independent of each explanatory variable, 
which implies that eo is uncorrelated with x. Then, the OLS estimators from equation 
(4.39) are consistent (and possibly unbiased as well). Further, the usual OLS infer- 
ence procedures (ż statistics, F statistics, LM statistics) are asymptotically valid under 
appropriate homoskedasticity assumptions. 

If eg and v are uncorrelated, as is usually assumed, then Var(v + eo) = o? + a > 
a?. Therefore, measurement error in the dependent variable results in a larger 
error variance than when the dependent variable is not measured with error. This 
result is hardly surprising and translates into larger asymptotic variances for the 
OLS estimators than if we could observe y*. But the larger error variance violates 
none of the assumptions needed for OLS estimation to have its desirable large-sample 
properties. 
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Example 4.6 (Saving Function with Measurement Error): Consider a saving function 
E(sav* | inc, size, educ, age) = By + Binec + fsize + B3educ + Byage 


but where actual saving (sav*) may deviate from reported saving (sav). The question 
is whether the measurement error in sav is systematically related to the other vari- 
ables. It may be reasonable to assume that the measurement error is not correlated 
with inc, size, educ, and age, but we might expect that families with higher incomes or 
more education report their saving more accurately. Unfortunately, without more 
information, we cannot know whether the measurement error is correlated with inc 
or educ. 


When the dependent variable is in logarithmic form, so that log(y*) is the depen- 
dent variable, a natural measurement error equation is 


log(y) = log(y*) + eo. (4.40) 


This follows from a multiplicative measurement error for y: y = y*ao where do > 0 
and e9 = log(ao). 


Example 4.7 (Measurement Error in Firm Scrap Rates): Yn Example 4.4, we might 
think that the firm scrap rate is mismeasured, leading us to postulate the model 
log(scrap*) = Bo + B, grant + v, where scrap* is the true scrap rate. The measurement 
error equation is log(scrap) = log(scrap*) + eo. Is the measurement error eo inde- 
pendent of whether the firm receives a grant? Not if a firm receiving a grant is more 
likely to underreport its scrap rate in order to make it look as if the grant had the 
intended effect. If underreporting occurs, then, in the estimable equation log(scrap) = 
Po + Bi grant + v + eo, the error u = v +e is negatively correlated with grant. This 
result would produce a downward bias in f,, tending to make the training program 
look more effective than it actually was. 


These examples show that measurement error in the dependent variable can cause 
biases in OLS if the measurement error is systematically related to one or more of the 
explanatory variables. If the measurement error is uncorrelated with the explanatory 
variables, OLS is perfectly appropriate. 


4.4.2 Measurement Error in an Explanatory Variable 


Traditionally, measurement error in an explanatory variable has been considered a 
much more important problem than measurement error in the response variable. This 
point was suggested by Example 4.2, and in this subsection we develop the general 
case. 
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We consider the model with a single explanatory measured with error: 


y = Po + bixi + Pox. +--+ bere tu (4.41) 
where y,x1,...,Xx 1 are observable but x; is not. We assume at a minimum that 
v has zero mean and is uncorrelated with x1, X2,...,Xg-1, Xz; in fact, we usually 


have in mind the structural model E(y|x1,...,xxK-1,X¢) = Po + Bix + fox. + + 
Prxx. If x% were observed, OLS estimation would produce consistent estimators. 
Instead, we have a measure of x;; call it xx. A maintained assumption is that v 
is also uncorrelated with xx. This follows under the redundancy assumption 
E(y|x1,.--,%K-1,X¢, XK) = E(y|™1,...,Xx-1,x;), an assumption we used in the 
proxy variable solution to the omitted variable problem. This means that xx has 
no effect on y once the other explanatory variables, including x%, have been con- 
trolled for. Because xj is assumed to be the variable that affects y, this assumption is 
uncontroversial. 
The measurement error in the population is simply 


ek = XK — Xz (4.42) 


and this can be positive, negative, or zero. We assume that the average measurement 
error in the population is zero: E(eg) = 0, which has no practical consequences be- 
cause we include an intercept in equation (4.41). Since v is assumed to be uncorre- 
lated with x; and xx, v is also uncorrelated with ex. 

We want to know the properties of OLS if we simply replace x; with xg and run 
the regression of y on 1, x1, xX2,..., Xg. These depend crucially on the assumptions we 
make about the measurement error. An assumption that is almost always maintained 
is that ex is uncorrelated with the explanatory variables not measured with error: 
E(xjex) =0, j=1,...,K-1. 

The key assumptions involve the relationship between the measurement error and 
xý and xx. Two assumptions have been the focus in the econometrics literature, and 
these represent polar extremes. The first assumption is that ex is uncorrelated with 
the observed measure, xx: 


Cov(xx, ex) = 0. (4.43) 


From equation (4.42), if assumption (4.43) is true, then ex must be correlated with 
the unobserved variable x;. To determine the properties of OLS in this case, we write 
Xp = Xx — ex and plug this into equation (4.41): 


Y = Po + Bix + 2X2 +--+ + Bexx + (v— Bex). (4.44) 
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Now, we have assumed that v and ex both have zero mean and are uncorrelated with 
each x;, including xx; therefore, v — Bxex has zero mean and is uncorrelated with the 
xj. It follows that OLS estimation with xx in place of x; produces consistent esti- 
mators of all of the £; (assuming the standard rank condition Assumption OLS.2). 
Since v is uncorrelated with ex, the variance of the error in equation (4.44) is 
Var(v — Bex) = 92 + Byo2.. Therefore, except when fg =0, measurement error 
increases the error variance, which is not a surprising finding and violates none of the 
OLS assumptions. 

The assumption that ex is uncorrelated with xx is analogous to the proxy variable 
assumption we made in the Section 4.3.2. Since this assumption implies that OLS has 
all its nice properties, this is not usually what econometricians have in mind when 
referring to measurement error in an explanatory variable. The classical errors-in- 
variables (CEV ) assumption replaces assumption (4.43) with the assumption that the 
measurement error is uncorrelated with the unobserved explanatory variable: 


Cov(xķ, ex) = 0. (4.45) 


This assumption comes from writing the observed measure as the sum of the true 
explanatory variable and the measurement error, xg = x% + ex, and then assuming 
the two components of xx are uncorrelated. (This has nothing to do with assump- 
tions about v; we are always maintaining that v is uncorrelated with x; and xx, and 
therefore with ex.) 

If assumption (4.45) holds, then xx and ex must be correlated: 
Cov(xx, ex) = E(xxex) = E(xzex) + E(e%) = 02. (4.46) 


eK 


Thus, under the CEV assumption, the covariance between xx and ex is equal to the 
variance of the measurement error. 

Looking at equation (4.44), we see that correlation between xx and ex causes 
problems for OLS. Because v and xx are uncorrelated, the covariance between 
xg and the composite error v — Brex is Cov(xx,v — Brex) = —Be Cov(xx, ex) = 
—B xo... It follows that, in the CEV case, the OLS regression of y on x1, X2,..., XK 
generally gives inconsistent estimators of all of the f;. 

The plims of the p, for j # K are difficult to characterize except under special 
assumptions. If x% is uncorrelated with x;, all j 4 K, then so is xx, and it follows that 
plim Ê; = fj, all j # K. The plim of Bx can be characterized in any case. Problem 
4.10 asks you to show that 


plim(Êk) = Bx (=). (4.47) 
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where r% is the linear projection error in 
xk = ôo + 01x] + 2X2 + +++ + ÔK-1XK-1 + rk. 


An important implication of equation (4.47) is that, because the term multiplying Bx 
is always between zero and one, |plim(f,)| < |Bx|. This is called the attenuation bias 
in OLS due to CEVs: on average (or in large samples), the estimated OLS effect will 
be attenuated as a result of the presence of CEVs. If Bx is positive, fg will tend to 
underestimate £g; if By is negative, Bx will tend to overestimate Bx. 

In the case of a single explanatory variable (K = 1) measured with error, equation 
(4.47) becomes 


j Gs 
li = b | ——=— 4.48 
plim £, = fı (= + =) ( ) 
The term multiplying £, in equation (4.48) is Var(x})/Var(x1), which is always less 
than unity under the CEV assumption (4.45). As Var(e;) shrinks relative to Var(x/), 
the attenuation bias disappears. 

In the case with multiple explanatory variables, equation (4.47) shows that it is not 
Ox that affects plim(f,) but the variance in x; after netting out the other explana- 
tory variables. Thus, the more collinear x% is with the other explanatory variables, 
the worse is the attenuation bias. 


Example 4.8 (Measurement Error in Family Income): Consider the problem of 
estimating the causal effect of family income on college grade point average, after 
controlling for high school grade point average and SAT score: 


colGPA = By + By faminc* + pyhsGPA + B,SAT +v, 


where faminc* is actual annual family income. Precise data on colGPA, hsGPA, and 
SAT are relatively easy to obtain from school records. But family income, especially 
as reported by students, could be mismeasured. If faminc = faminc* + ej, and the 
CEV assumptions hold, then using reported family income in place of actual family 
income will bias the OLS estimator of f; toward zero. One consequence is that a 
hypothesis test of Ho: 6; = 0 will have a higher probability of Type IJ error. 


If measurement error is present in more than one explanatory variable, deriving 
the inconsistency in the OLS estimators under extensions of the CEV assumptions is 
complicated and does not lead to very usable results. 

In some cases it is clear that the CEV assumption (4.45) cannot be true. For ex- 
ample, suppose that frequency of marijuana usage is to be used as an explanatory 
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variable in a wage equation. Let smoked* be the number of days, out of the last 30, 
that a worker has smoked marijuana. The variable smoked is the self-reported num- 
ber of days. Suppose we postulate the standard measurement error model, smoked = 
smoked* + ej, and let us even assume that people try to report the truth. It seems 
very likely that people who do not smoke marijuana at all—so that smoked* = 0— 
will also report smoked = 0. In other words, the measurement error is zero for people 
who never smoke marijuana. When smoked* > 0 it is more likely that someone mis- 
counts how many days he or she smoked marijuana. Such miscounting almost cer- 
tainly means that e} and smoked* are correlated, a finding that violates the CEV 
assumption (4.45). 

A general situation where assumption (4.45) is necessarily false occurs when the 
observed variable xx has a smaller population variance than the unobserved variable 
xý. Of course, we can rarely know with certainty whether this is the case, but we 
can sometimes use introspection. For example, consider actual amount of schooling 
versus reported schooling. In many cases, reported schooling will be a rounded-off 
version of actual schooling; therefore, reported schooling is less variable than actual 
schooling. 


Problems 


4.1. Consider a standard log(wage) equation for men under the assumption that all 
explanatory variables are exogenous: 


log(wage) = By + B\married + P educ + zy + u, (4.49) 
E(u | married, educ, z) = 0, 


where z contains factors other than marital status and education that can affect 
wage. When f; is small, 100-6, is approximately the ceteris paribus percentage dif- 
ference in wages between married and unmarried men. When £; is large, it might be 
preferable to use the exact percentage difference in E(wage| married, educ,z). Call 
this 0}. 

a. Show that, if u is independent of all explanatory variables in equation (4.49), then 
0, = 100 - [exp(f,) — 1]. (Hint: Find E(wage | married, educ,z) for married = 1 and 
married = 0, and find the percentage difference.) A natural, consistent, estimator of 
0, is 6; = 100 - [exp(B,) — 1], where , is the OLS estimator from equation (4.49). 


b. Use the delta method (see Section 3.5.2) to show that asymptotic standard error of 
0; is [100 - exp(£1)] - se(B;). 
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c. Repeat parts a and b by finding the exact percentage change in E(wage | married, 
educ,z) for any given change in educ, Aeduc. Call this 02. Explain how to estimate 
Q and obtain its asymptotic standard error. 


d. Use the data in NLS80.RAW to estimate equation (4.49), where z contains the 
remaining variables in equation (4.29) (except ability, of course). Find @ and its 
standard error; find 0, and its standard error when Aeduc = 4. 


4.2. a. Show that, under random sampling and the zero conditional mean as- 
sumption E(w|x) = 0, E($ |X) = £ if X’X is nonsingular. (Hint: Use Property CE.5 
in the appendix to Chapter 2.) 

b. In addition to the assumptions from part a, assume that Var(u |x) = a”. Show 
that Var(B|X) = 0?(X’X)'. 


4.3. Suppose that in the linear model (4.5), E(x’) = 0 (where x contains unity), 
Var(u|x) = 0°, but E(u|x) # E(u). 
a. Is it true that E(u? |x) = 07? 


b. What relevance does part a have for OLS estimation? 


4.4, Show that the estimator B = N~! YX | i#?x!x; is consistent for B = E(u?x'x) by 
showing that N~! 37%, a2x!x; = N-! 7, u?x!x; + 0, (1). (Hint: Write a? = u? — 
2x;u;(B — P) + [x;(B — B]’, and use the facts that sample averages are O,(1) when 
expectations exist and that B — Bp = o,(1). Assume that all necessary expectations 


exist and are finite.) 


4.5. Let y and z be random scalars, and let x be a 1 x K random vector, where one 
element of x can be unity to allow for a nonzero intercept. Consider the population 
model 


E(y|x,z) = xB + yz, (4.50) 
Var(y|x,z) = 0°, (4.51) 


where interest lies in the K x 1 vector $. To rule out trivialities, assume that y 4 0. In 
addition, assume that x and z are orthogonal in the population: E(x‘z) = 0. 
Consider two estimators of $ based on N independent and identically distributed 
observations: (1) Ê (obtained along with 9) is from the regression of y on x and z; (2) 
B is from the regression of y on x. Both estimators are consistent for B under equa- 


tion (4.50) and E(x’z) = 0 (along with the standard rank conditions). 


a. Show that, without any additional assumptions (except those needed to apply 
the law of large numbers and the central limit theorem), Avar VN(£ — £) — 
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Avar /N(B — B) is always positive semidefinite (and usually positive definite). 
Therefore, from the standpoint of asymptotic analysis, it is always better under 
equations (4.50) and (4.51) to include variables in a regression model that are 
uncorrelated with the variables of interest. 

b. Consider the special case where z = (xx — ug)’, ug = E(xx), and xg is symetri- 
cally distributed: E[(xx — u)*] = 0. Then fx is the partial effect of xx on E(y|x) 
evaluated at xx = ug. Is it better to estimate the average partial effect with or with- 
out (xx — ug)” included as a regressor? 


c. Under the setup in Problem 2.3, with Var(y|x) =o’, is it better to estimate £; 
and p, with or without x; x2 in the regression? 


4.6. Let the variable nonwhite be a binary variable indicating race: nonwhite = 1 if 
the person is a race other than white. Given that race is determined at birth and is 
beyond an individual’s control, explain how nonwhite can be an endogenous explan- 
atory variable in a regression model. In particular, consider the three kinds of endo- 
geneity discussed in Section 4.1. 


4.7. Consider estimating the effect of personal computer ownership, as represented 
by a binary variable, PC, on college GPA, co/GPA. With data on SAT scores and 
high school GPA you postulate the model 


colGPA = po + B\hsGPA + B,SAT + B,PC +u. 


a. Why might u and PC be positively correlated? 


b. If the given equation is estimated by OLS using a random sample of college 
students, is J} likely to have an upward or downward asymptotic bias? 

c. What are some variables that might be good proxies for the unobservables in u 
that are correlated with PC? 


4.8. Consider a population regression with two explanatory variables, but where 
they have an interactive effect and x) appears as a quadratic: 


E(y | x1, x2) = Bo + Bix1 + pax + B3x1x2 + pax. 
Let “4, = E(x) and {4 = E(x2) be the population means of the explanatory variables. 


a. Let a denote the average partial effect (across the distribution of the explanatory 
variables) of x; on E(y | x1, x2), and let 2 be the same for x2. Find « and a in terms 
of the f; and ju. 

b. Rewrite the regression function so that «; and %2 appear directly. (Note that 4 
and fy will also appear.) 
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c. Given a random sample, what regression would you run to estimate « and a 
directly? What if you do not know u and fy? 


d. Apply part c to the data in NLS80.RAW, where y = log(wage), x; = educ, and 
X2 = exper. (You will have to plug in the sample averages of educ and exper.) Com- 
pare coefficients and standard errors when the interaction term is educ-exper instead, 
and discuss. 


4.9. Consider a linear model where the dependent variable is in logarithmic form, 
and the lag of log( y) is also an explanatory variable: 


log(y) = Po + xB + i log(y—ı) tu,  E(u|x, yı) =0, 


where the inclusion of log(y_,) might be to control for correlation between policy 
variables in x and a previous value of y; see Example 4.4. 


a. For estimating f, why do we obtain the same estimator if the growth in y, log( y) — 
log(y_,), is used instead as the dependent variable? 


b. Suppose that there are no covariates x in the equation. Show that, if the dis- 
tributions of y and y_, are identical, then || < 1. This is the regression-to-the-mean 
phenomenon in a dynamic setting. (Hint: Show that «xı = Corr[log(y), log(y_,)].) 


4.10. Use Property LP.7 from Chapter 2 (particularly equation (2.56)) and Problem 
2.6 to derive equation (4.47). (Hint: First use Problem 2.6 to show that the popula- 
tion residual rx, in the linear projection of xx on 1,x1,...,XK-1, is r% +ex. Then 
find the projection of y on rg and use Property LP.7.) 


4.11. a. In Example 4.3, use KWW and JO simultaneously as proxies for ability 
in equation (4.29). Compare the estimated return to education without a proxy for 
ability and with JQ as the only proxy for ability. 

b. Test KWW and TỌ for joint significance in the estimated equation from part a. 

c. When KWW and IQ are used as proxies for abil, does the wage differential be- 
tween nonblacks and blacks disappear? What is the estimated differential? 

d. Add the interactions educ(IQ — 100) and educ(KWW — KWW) to the regression 
from part a, where KWW is the average score in the sample. Are these terms jointly 
significant using a standard F test? Does adding them affect any important con- 
clusions? 


4.12. Redo Example 4.4, adding the variable union—a dummy variable indicat- 
ing whether the workers at the plant are unionized—as an additional explanatory 
variable. 
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4.13. Use the data in CORNWELL.RAW (from Cornwell and Trumball, 1994) to 
estimate a model of county-level crime rates, using the year 1987 only. 


a. Using logarithms of all variables, estimate a model relating the crime rate to the 
deterrent variables prbarr, prbconv, prbpris, and avgsen. 


b. Add log(crmrte) for 1986 as an additional explanatory variable, and comment on 
how the estimated elasticities differ from part a. 


c. Compute the F statistic for joint significance of all of the wage variables (again in 
logs), using the restricted model from part b. 


d. Redo part c, but make the test robust to heteroskedasticity of unknown form. 


4.14. Use the data in ATTEND.RAW to answer this question. 

a. To determine the effects of attending lecture on final exam performance, estimate 
a model relating stndfnl (the standardized final exam score) to atndrte (the percent of 
lectures attended). Include the binary variables frosh and soph as explanatory vari- 
ables. Interpret the coefficient on atndrte, and discuss its significance. 

b. How confident are you that the OLS estimates from part a are estimating the 
causal effect of attendence? Explain. 

c. As proxy variables for student ability, add to the regression priGPA (prior cumu- 
lative GPA) and ACT (achievement test score). Now what is the effect of atndrte? 
Discuss how the effect differs from that in part a. 

d. What happens to the significance of the dummy variables in part c as compared 
with part a? Explain. 

e. Add the squares of priGPA and ACT to the equation. What happens to the co- 
efficient on atndrte? Are the quadratics jointly significant? 

f. To test for a nonlinear effect of atndrte, add its square to the equation from part e. 
What do you conclude? 


4.15. Assume that y and each x; have finite second moments, and write the linear 
projection of y on (1, x1,..., Xg) as 

Y = Po + Bix +--+ BexXK +u = Pot xB +u, 

E(u) = 0, E(xju) = 0, J= 1 Quang Ke 

a. Show that o? = Var(xf) + o2. 


b. For a random draw i from the population, write y, = fọ + x;8 + u;. Evaluate the 
following assumption, which has been known to appear in econometrics textbooks: 
“Var(u;) = o? = Var(y;) for all i.” 
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c. Define the population R-squared by p° = 1 — aj /o; = Var(xf)/a,. Show that the 
R-squared, R? = 1 — SSR/SST, is a consistent estimator of p*, where SSR is the OLS 
sum of squared residuals and SST = poe i(i — >)? is the total sum of squares. 

d. Evaluate the following statement: “In the presence of heteroskedasticity, the R- 
squared from an OLS regression is meaningless.” (This kind of statement also tends 
to appear in econometrics texts.) 


4.16. Let {(x;,u;):i=1,2,...} be a sequence of independent, not identically 
distributed (i-n.i.d.) random vectors following the linear model y; = x;B+u;, i= 
1,2,.... Assume that E(x/w;) = 0 for all i, that N~! YDA, E(x!x;) — A, where A is a 
K x K positive definite matrix, and that {x/x;} satisfies the law of large numbers: 
NDN ee Exx) S 0, 

a. If N~'! 537%, x/u; also satisfies the law of large numbers, prove that the OLS esti- 
mator, ß, from the regression y; on x;,i = 1,2,..., N, is consistent for f. 

b. Define By = N! YA E(u2x!x;) for each N and assume that By >B as 
N— œ. Further, assume that N-25% x!u; > Normal(0,B). Show that 
VN( — B) is asymptotically normal and find its asymptotic variance matrix. 

c. Suggest consistent estimators of A and B; you need not prove consistency. 


d. Comment on how the estimators in part c compare with the i.i.d. case. 

4.17. Consider the standard linear model y = xf + u under Assumptions OLS.1 
and OLS.2. Define h(x) = E(u? |x). Let f be the OLS estimator, and show that we 
can always write 

AvarVN(B — P) = [E(x'x)|'E[h(x)x’x][E(x’x)]/. 

This expression is useful when E(u | x) = 0 for comparing the asymptotic variances of 
OLS and weighted least squares estimators; see, for example, Wooldridge (1994b). 
4.18. Describe what is wrong with each of the following two statements: 


a. “The central limit theorem implies that, as the sample size grows, the error dis- 
tribution approaches normality.” 


b. “The heteroskedasticity-robust standard errors are consistent because ù? (the 
squared OLS residual) is a consistent estimator of E(u? | x;) for each i.” 


5 Instrumental Variables Estimation of Single-Equation Linear Models 


In this chapter we treat instrumental variables estimation, which is probably second 
only to ordinary least squares in terms of methods used in empirical economic re- 
search. The underlying population model is the same as in Chapter 4, but we explic- 
itly allow the unobservable error to be correlated with the explanatory variables. 


5.1 Instrumental Variables and Two-Stage Least Squares 


5.1.1 Motivation for Instrumental Variables Estimation 


To motivate the need for the method of instrumental variables, consider a linear 
population model 


Y = Bo + PiX + Box. +++ + BexK +u, (5.1) 
E(u) = 0, Cov(x;,u) = 0, J=1,2,...,K—-1, (5.2) 


but where xx might be correlated with u. In other words, the explanatory variables 
X1, X2,...,Xg-ı are exogenous, but xx is potentially endogenous in equation (5.1). 
The endogeneity can come from any of the sources we discussed in Chapter 4. To fix 
ideas, it might help to think of u as containing an omitted variable that is uncorre- 
lated with all explanatory variables except xx. So, we may be interested in a condi- 
tional expectation as in equation (4.18), but we do not observe q, and q is correlated 
with xx. 

As we saw in Chapter 4, OLS estimation of equation (5.1) generally results in in- 
consistent estimators of all the J, if Cov(xx,u) # 0. Further, without more informa- 
tion, we cannot consistently estimate any of the parameters in equation (5.1). 

The method of instrumental variables (IV) provides a general solution to the 
problem of an endogenous explanatory variable. To use the IV approach with xx 
endogenous, we need an observable variable, zı, not in equation (5.1) that satisfies 
two conditions. First, zı must be uncorrelated with u: 


Cov(z1,u) = 0. (5.3) 


In other words, like x1, ...,Xg-1, 21 is exogenous in equation (5.1). 

The second requirement involves the relationship between zı and the endogenous 
variable, xx. A precise statement requires the linear projection of xx onto all the 
exogenous variables: 


XK = ĝo + ô1X1 +02X2 + +++ + 0K-1XK-1 + 121 +1, (5.4) 


where, by definition of a linear projection error, E(rx) = 0 and rx is uncorrelated 
with x1, X2,...,XK_-1, and zı. The key assumption on this linear projection is that the 


90 Chapter 5 


coefficient on zı is nonzero: 


0, £0. (5.5) 


(3 


This condition is often loosely described as “zı is correlated with xg,” but that 
statement is not quite correct. The condition 0; # 0 means that z; is partially corre- 
lated with xx once the other exogenous variables x;,...,x_1 have been netted out. 
If xx is the only explanatory variable in equation (5.1), then the linear projection is 
xg = ôo + 0121 + rg, where 0; = Cov(z1,xx)/Var(z1), and so condition (5.5) and 
Cov(z1,xx) #0 are the same. 

At this point we should mention that we have put no restrictions on the distribu- 
tion of xx or zı. In many cases xx and z; will be both essentially continuous, but 
sometimes xx, Z1, or both are discrete. In fact, one or both of xx and zı can be binary 
variables, or have continuous and discrete characteristics at the same time. Equation 
(5.4) is simply a linear projection, and this is always defined when second moments of 
all variables are finite. 

When z; satisfies conditions (5.3) and (5.5), then it is said to be an instrumental 
variable (IV) candidate for xg. (Sometimes zı is simply called an instrument for xx.) 
Because x},...,Xx—, are already uncorrelated with u, they serve as their own instru- 
mental variables in equation (5.1). In other words, the full list of instrumental vari- 
ables is the same as the list of exogenous variables, but we often just refer to the 
instrument for the endogenous explanatory variable. 

The linear projection in equation (5.4) is called a reduced-form equation for the 
endogenous explanatory variable xx. In the context of single-equation linear models, 
a reduced form always involves writing an endogenous variable as a linear projection 
onto all exogenous variables. The “reduced form” terminology comes from simulta- 
neous equations analysis, and it makes more sense in that context. We use it in all IV 
contexts because it is a concise way of stating that an endogenous variable has been 
linearly projected onto the exogenous variables. The terminology also conveys that 
there is nothing necessarily structural about equation (5.4). 

From the structural equation (5.1) and the reduced form for xx, we obtain a 
reduced form for y by plugging equation (5.4) into equation (5.1) and rearranging: 


y = hy + OX] +++ + OK-1XK-1 +421 +v, (5.6) 


where v = u + Bxrx is the reduced-form error, %; = p; + Bxd;, and 41 = 6x6). By our 
assumptions, v is uncorrelated with all explanatory variables in equation (5.6), and so 
OLS consistently estimates the reduced-form parameters, the «; and A). 

Estimates of the reduced-form parameters are sometimes of interest in their own 
right, but estimating the structural parameters is generally more useful. For example, 
at the firm level, suppose that xx is job training hours per worker and y is a measure 
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of average worker productivity. Suppose that job training grants were randomly 
assigned to firms. Then it is natural to use for zı either a binary variable indicating 
whether a firm received a job training grant or the actual amount of the grant per 
worker (if the amount varies by firm). The parameter fx in equation (5.1) is the effect 
of job training on worker productivity. If z} is a binary variable for receiving a 
job training grant, then /, is the effect of receiving this particular job training grant 
on worker productivity, which is of some interest. But estimating the effect of an 
hour of general job training is more valuable because it can be meaningful in many 
situations. 

We can now show that the assumptions we have made on the IV zı solve the 
identification problem for the J, in equation (5.1). By identification we mean that we 
can write the f; in terms of population moments in observable variables. To see how, 
write equation (5.1) as 


y=xfPt+u, (5.7) 


where the constant is absorbed into x so that x = (1,x2,...,xx). Write the 1 x K 
vector of all exogenous variables as 


z = (1,%,...,XK-1,21). 
Assumptions (5.2) and (5.3) imply the K population orthogonality conditions 
E(z'u) = 0. (5.8) 


Multiplying equation (5.7) through by z’, taking expectations, and using equation 
(5.8) gives 


[E(2"x)]B = E(z'y), (5.9) 
where E(z’x) is K x K and E(z'y) is K x 1. Equation (5.9) represents a system of K 
linear equations in the K unknowns £4, f,..., Pg. This system has a unique solution 


if and only if the K x K matrix E(z’x) has full rank; that is, 

rank E(z’x) = K, (5.10) 
in which case the solution is 

p = [Ex] ERY). (5.11) 


The expectations E(z’x) and E(z'y) can be consistently estimated using a random 
sample on (x, y, 21), and so equation (5.11) identifies the vector $. 

It is clear that condition (5.3) was used to obtain equation (5.11). But where have 
we used condition (5.5)? Let us maintain that there are no linear dependencies among 
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the exogenous variables, so that E(z’z) has full rank K; this simply rules out perfect 
collinearity in z in the population. Then, it can be shown that equation (5.10) holds if 
and only if 6; # 0. (A more general case, which we cover in Section 5.1.2, is covered 
in Problem 5.12.) Therefore, along with the exogeneity condition (5.3), assumption 
(5.5) is the key identification condition. Assumption (5.10) is the rank condition for 
identification, and we return to it more generally in Section 5.2.1. 

Given a random sample {(x;, y; Za): i= 1,2,...,N} from the population, the in- 
strumental variables estimator of £ is 


N =l N 
p= (yeas) (San) = (Z'X)'Z'Y, 
i=] i=] 


where Z and X are N x K data matrices and Y is the N x 1 data vector on the y,. 
The consistency of this estimator is immediate from equation (5.11) and the law of 
large numbers. We consider a more general case in Section 5.2.1. 

When searching for instruments for an endogenous explanatory variable, con- 
ditions (5.3) and (5.5) are equally important in identifying £. There is, however, one 
practically important difference between them: condition (5.5) can be tested, whereas 
condition (5.3) must be maintained. The reason for this disparity is simple: the 
covariance in condition (5.3) involves the unobservable u, and therefore we cannot 
test anything about Cov(z;, u). 

Testing condition (5.5) in the reduced form (5.4) is a simple matter of computing a 
t test after OLS estimation. Nothing guarantees that rx satisfies the requisite homo- 
skedasticity assumption (Assumption OLS.3), so a heteroskedasticity-robust ¢ statis- 
tic for 6; is often warranted. This statement is especially true if xx is a binary variable 
or some other variable with discrete characteristics. 

A word of caution is in order here. Econometricians have been known to say that 
“it is not possible to test for identification.” In the model with one endogenous vari- 
able and one instrument, we have just seen the sense in which this statement is true: 
assumption (5.3) cannot be tested. Nevertheless, the fact remains that condition (5.5) 
can and should be tested. In fact, recent work has shown that the strength of the re- 
jection in condition (5.5) (in a p-value sense) is important for determining the finite 
sample properties, particularly the bias, of the IV estimator. We return to this issue in 
Section 5.2.6. 

In the context of omitted variables, an instrumental variable, like a proxy variable, 
must be redundant in the structural model (that is, the model that explicitly contains 
the unobservables; see condition (4.25)). However, unlike a proxy variable, an IV for 
Xx Should be uncorrelated with the omitted variable. Remember, we want a proxy 
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variable to be highly correlated with the omitted variable. A proxy variable makes a 
poor IV and, as we saw in Section 4.3.3, an IV makes a poor proxy variable. 


Example 5.1 (Instrumental Variables for Education in a Wage Equation): Consider 
a wage equation for the U.S. working population: 


log(wage) = By + B exper + B,exper? + Bzeduc + u, (5.12) 


where u is thought to be correlated with educ because of omitted ability, as well as 
other factors, such as quality of education and family background. Suppose that we 
can collect data on mother’s education, motheduc. For this to be a valid instrument 
for educ we must assume that motheduc is uncorrelated with u and that 0; 4 0 in the 
reduced-form equation 


educ = ôo + 6\exper + d2exper? + 0,;motheduc + r. 


There is little doubt that educ and motheduc are partially correlated, and this asser- 
tion is easily tested given a random sample from the population. The potential 
problem with motheduc as an instrument for educ is that motheduc might be corre- 
lated with the omitted factors in u: mother’s education is likely to be correlated with 
child’s ability and other family background characteristics that might be in u. 

A variable such as the last digit of one’s social security number makes a poor IV 
candidate for the opposite reason. Because the last digit is randomly determined, it is 
independent of other factors that affect earnings. But it is also independent of edu- 
cation. Therefore, while condition (5.3) holds, condition (5.5) does not. 

By being clever it is often possible to come up with more convincing instruments— 
at least at first glance. Angrist and Krueger (1991) propose using quarter of birth as 
an IV for education. In the simplest case, let frstgrt be a dummy variable equal to 
unity for people born in the first quarter of the year and zero otherwise. Quarter of 
birth is arguably independent of unobserved factors such as ability that affect wage 
(although there is disagreement on this point; see Bound, Jaeger, and Baker (1995)). 
In addition, we must have 0, 4 0 in the reduced form 


educ = ôo + 0\exper + d2exper? + 0; frstgrt + r. 


How can quarter of birth be (partially) correlated with educational attainment? 
Angrist and Krueger (1991) argue that compulsory school attendence laws induce a 
relationship between educ and /frstgqrt: at least some people are forced, by law, to at- 
tend school longer than they otherwise would, and this fact is correlated with quarter 
of birth. We can determine the strength of this association in a particular sample by 
estimating the reduced form and obtaining the ż statistic for Ho: 0; = 0. 
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This example illustrates that it can be very difficult to find a good instrumental 
variable for an endogenous explanatory variable because the variable must satisfy 
two different, often conflicting, criteria. For motheduc, the issue in doubt is whether 
condition (5.3) holds. For frstgrt, the initial concern is with condition (5.5). Since 
condition (5.5) can be tested, frstgrt has more appeal as an instrument. However, the 
partial correlation between educ and frstgrt is small, and this can lead to finite sample 
problems (see Section 5.2.6). A more subtle issue concerns the sense in which we are 
estimating the return to education for the entire population of working people. As we 
will see in Chapter 18, if the return to education is not constant across people, the IV 
estimator that uses frstqrt as an IV estimates the return to education only for those 
people induced to obtain more schooling because they were born in the first quarter 
of the year. These make up a relatively small fraction of the population. 

Convincing instruments sometimes arise in the context of program evaluation, 
where individuals are randomly selected to be eligible for the program. Examples 
include job training programs and school voucher programs. Actual participation is 
almost always voluntary, and it may be endogenous because it can depend on unob- 
served factors that affect the response. However, it is often reasonable to assume that 
eligibility is exogenous. Because participation and eligibility are correlated, the latter 
can be used as an IV for the former. 

A valid instrumental variable can also come from what is called a natural experi- 
ment. A natural experiment occurs when some (often unintended) feature of the setup 
we are studying produces exogenous variation in an otherwise endogenous explana- 
tory variable. The Angrist and Krueger (1991) example seems, at least initially, to be 
a good natural experiment. Another example is given by Angrist (1990), who studies 
the effect of serving in the Vietnam war on the earnings of men. Participation in the 
military is not necessarily exogenous to unobserved factors that affect earnings, even 
after controlling for education, nonmilitary experience, and so on. Angrist used the 
following observation to obtain an instrumental variable for the binary Vietnam war 
participation indicator: men with a lower draft lottery number were more likely to 
serve in the war. Angrist verifies that the probability of serving in Vietnam is indeed 
related to draft lottery number. Because the lottery number is randomly determined, 
it seems like an ideal IV for serving in Vietnam. There are, however, some potential 
problems. It might be that men who were assigned a low lottery number chose to 
obtain more education as a way of increasing the chance of obtaining a draft defer- 
ment. If we do not control for education in the earnings equation, lottery number 
could be endogenous. Further, employers may have been willing to invest in job 
training for men who are unlikely to be drafted. Again, unless we can include mea- 
sures of job training in the earnings equation, condition (5.3) may be violated. (This 
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reasoning assumes that we are interested in estimating the pure effect of serving in 
Vietnam, as opposed to including indirect effects such as reduced job training.) 

Hoxby (2000) uses topographical features, in particular the natural boundaries 
created by rivers, as IVs for the concentration of public schools within a school dis- 
trict. She uses these IVs to estimate the effects of competition among public schools 
on student performance. Cutler and Glaeser (1997) use the Hoxby instruments, as 
well as others, to estimate the effects of segregation on schooling and employment 
outcomes for blacks. Levitt (1997) provides another example of obtaining instrumen- 
tal variables from a natural experiment. He uses the timing of mayoral and guber- 
natorial elections as instruments for size of the police force in estimating the effects of 
police on city crime rates. (Levitt actually uses panel data, something we will discuss 
in Chapter 11.) 

Sensible IVs need not come from natural experiments. For example, Evans and 
Schwab (1995) study the effect of attending a Catholic high school on various out- 
comes. They use a binary variable for whether a student is Catholic as an IV for 
attending a Catholic high school, and they spend much effort arguing that religion is 
exogenous in their versions of equation (5.7). (In this application, condition (5.5) is 
easy to verify.) Economists often use regional variation in prices or taxes as instru- 
ments for endogenous explanatory variables appearing in individual-level equations. 
For example, in estimating the effects of alcohol consumption on performance in 
college, the local price of alcohol can be used as an IV for alcohol consumption, 
provided other regional factors that affect college performance have been appropri- 
ately controlled for. The idea is that the price of alcohol, including any taxes, can be 
assumed to be exogenous to each individual. 


Example 5.2 (College Proximity as an IV for Education): Using wage data for 
1976, Card (1995) uses a dummy variable that indicates whether a man grew up in 
the vicinity of a four-year college as an instrumental variable for years of schooling. 
He also includes several other controls. In the equation with experience and its 
square, a black indicator, southern and urban indicators, and regional and urban 
indicators for 1966, the instrumental variables estimate of the return to schooling is 
.132, or 13.2 percent, while the OLS estimate is 7.5 percent. Thus, for this sample of 
data, the IV estimate is almost twice as large as the OLS estimate. This result would 
be counterintuitive if we thought that an OLS analysis suffered from an upward 
omitted variable bias. One interpretation is that the OLS estimators suffer from the 
attenuation bias as a result of measurement error, as we discussed in Section 4.4.2. 
But the classical errors-in-variables assumption for education is questionable. Another 
interpretation is that the instrumental variable is not exogenous in the wage equation: 
location is not entirely exogenous. The full set of estimates, including standard errors 
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and ¢ statistics, can be found in Card (1995). Or, you can replicate Card’s results in 
Problem 5.4. 


5.1.2 Multiple Instruments: Two-Stage Least Squares 


Consider again the model (5.1) and (5.2), where xx can be correlated with u. Now, 
however, assume that we have more than one instrumental variable for xg. Let z1, 
Z2,...,Zm be variables such that 


Cov(zn, u) = 0, h= l; 2M (5.13) 


so that each z, is exogenous in equation (5.1). If each of these has some partial cor- 
relation with xx, we could have M different IV estimators. Actually, there are many 
more than this—more than we can count—since any linear combination of x, 
X2,.--, XK—1, 21, 22,---,;Z is uncorrelated with u. So which IV estimator should we 
use? 

In Section 5.2.3 we show that, under certain assumptions, the two-stage least 
squares (2SLS) estimator is the most efficient IV estimator. For now, we rely on 
intuition. 

To illustrate the method of 2SLS, define the vector of exogenous variables again by 
z = (1,%1,%2,..-,XK-1,21,---;Zu), a1 x L vector (L = K + M). Out of all possible 
linear combinations of z that can be used as an instrument for xg, the method of 
2SLS chooses that which is most highly correlated with xx. If xx were exogenous, 
then this choice would imply that the best instrument for xx is simply itself. Ruling 
this case out, the linear combination of z most highly correlated with xx is given by 
the linear projection of xx on z. Write the reduced form for xx as 


XK = ĝo +0)X1 +++ + 0K -1XK-1 + O12, +--+ OuzZuM +'K, (5.14) 


where, by definition, rg has zero mean and is uncorrelated with each right-hand-side 
variable. As any linear combination of z is uncorrelated with u, 


Xe = ĝo +01 x1 + +++ +61 XK) + OZ + +++ + OMM (5.15) 


is uncorrelated with u. In fact, x; is often interpreted as the part of xx that is 
uncorrelated with u. If xx is endogenous, it is because rg is correlated with u. 

If we could observe x;, we would use it as an instrument for xx in equation (5.1) 
and use the IV estimator from the previous subsection. Since the 6; and 0; are pop- 
ulation parameters, x; is not a usable instrument. However, as long as we make the 
standard assumption that there are no exact linear dependencies among the exoge- 
nous variables, we can consistently estimate the parameters in equation (5.14) by 
OLS. The sample analogues of the x}, for each observation 7 are simply the OLS 
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fitted values: 


Rix = ôo + dix +++ + OK 14, K- + Âiza + + Ou zim. (5.16) 
Now, for each observation i, define the vector x; = (1,¥i,...,%,xK-1,Xix), i= 
1,2,...,N. Using x; as the instruments for x; gives the IV estimator 


N “17H 
p= (sos) (x0) = (X’X) 'X’Y, (5.17) 


where unity is also the first element of x;. 

The IV estimator in equation (5.17) turns out to be an OLS estimator. To see this 
fact, note that the N x (K + 1) matrix X can be expressed as X = Z(Z'Z)'Z'X = 
PzX, where the projection matrix Pz = AVA ew A is idempotent and symmetric. 
Therefore, X'X = X/PzX = (PzX)'PzX = X’X. Plugging this expression into equa- 
tion (5.17) shows that the IV estimator that uses tnstraEoHs X; can be written as 
p= (x’ x)" 'X’Y. The name “two- -stage least squares” comes from this procedure. 

To summarize, f can be obtained from the following steps: 


1. Obtain the fitted values Xx from the regression 
XK on l; Xi; sess XK-l; Zl; >13 ZM; (5.18) 


where the 7 subscript is omitted for simplicity. This is called the first-stage regression. 


2. Run the OLS regression 
y on 1, Xisco AKD XK. (5.19) 
This is called the second-stage regression, and it produces the We 


In practice, it is best to use a software package with a 2SLS command rather than 
explicitly carry out the two-step procedure. Carrying out the two-step procedure 
explicitly makes one susceptible to harmful mistakes. For example, the following, 
seemingly sensible, two-step procedure is generally inconsistent: (1) regress xx on 
1,21,...,2Z and obtain the fitted values, say Xx; (2) run the regression in (5.19) with 
Xx in place of Xx. Problem 5.11 asks you to show that omitting x),...,xx_; in the 
first-stage regression and then explicitly doing the second-stage regression produces 
inconsistent estimators of the f. 

Another reason to avoid the two-step procedure is that the OLS standard errors 
reported with regression (5.19) will be incorrect, something that will become clear 
later. Sometimes for hypothesis testing we need to carry out the second-stage regres- 
sion explicitly (see Section 5.2.4). 
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The 2SLS estimator and the IV estimator from Section 5.1.1 are identical when 
there is only one instrument for xg. Unless stated otherwise, we mean 2SLS whenever 
we talk about IV estimation of a single equation. 

What is the analogue of the condition (5.5) when more than one instrument is 
available with one endogenous explanatory variable? Problem 5.12 asks you to show 
that E(z’x) has full column rank if and only if at least one of the 0; in equation (5.14) 
is nonzero. The intuition behind this requirement is pretty clear: we need at least one 
exogenous variable that does not appear in equation (5.1) to induce variation in xx 
that cannot be explained by x),...,xx_1. Identification of $ does not depend on the 
values of the ô, in equation (5.14). 

Testing the rank condition with a single endogenous explanatory variable and multi- 
ple instruments is straightforward. In equation (5.14) we simply test the null hypothesis 


Ho: 0; = 0, 0, =0,...,0y4 =90 (5.20) 


against the alternative that at least one of the 6; is different from zero. This test gives 
a compelling reason for explicitly running the first-stage regression. If rx in equation 
(5.14) satisfies the OLS homoskedasticity assumption OLS.3, a standard F statistic or 
Lagrange multiplier statistic can be used to test hypothesis (5.20). Often a hetero- 
skedasticity-robust statistic is more appropriate, especially if xx has discrete charac- 
teristics. If we cannot reject hypothesis (5.20) against the alternative that at least one 
O, is different from zero, at a reasonably small significance level, then we should have 
serious reservations about the proposed 2SLS procedure: the instruments do not pass 
a minimal requirement. 

The model with a single endogenous variable is said to be overidentified when M > 
1 and there are M — 1 overidentifying restrictions. This terminology comes from the 
fact that, if each z, has some partial correlation with xg, then we have M — 1 more 
exogenous variables than needed to identify the parameters in equation (5.1). For 
example, if M = 2, we could discard one of the instruments and still achieve identi- 
fication. In Chapter 6 we will show how to test the validity of any overidentifying 
restrictions. 


5.2 General Treatment of Two-Stage Least Squares 


5.2.1 Consistency 


We now summarize asymptotic results for 2SLS in a single-equation model with 
perhaps several endogenous variables among the explanatory variables. Write the 
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population model as in equation (5.7), where x is 1 x K and generally includes unity. 
Several elements of x may be correlated with u. As usual, we assume that a random 
sample is available from the population. 


ASSUMPTION 2SLS.1: For some 1 x L vector z, E(z'u) = 0. 


Here we do not specify where the elements of z come from, but any exogenous ele- 
ments of x, including a constant, are included in z. Unless every element of x is ex- 
ogenous, z will have to contain variables obtained from outside the model. The zero 
conditional mean assumption, E(u|z) = 0, implies Assumption 2SLS.1. 

The next assumption contains the general rank condition for single-equation 
analysis. 


ASSUMPTION 2SLS.2: (a) rank E(z’z) = L; (b) rank E(z’x) = K. 


Technically, part a of this assumption is needed, but it is not especially important, 
since the exogenous variables, unless chosen unwisely, will be linearly independent in 
the population (as well as in a typical sample). Part b is the crucial rank condition for 
identification. In a precise sense it means that z is sufficiently linearly related to x so 
that rank E(z’x) has full column rank. We discussed this concept in Section 5.1 for 
the situation in which x contains a single endogenous variable. When x is exogenous, 
so that z= x, Assumption 2SLS.1 reduces to Assumption OLS.1 and Assumption 
2SLS.2 reduces to Assumption OLS.2. 

Necessary for the rank condition is the order condition, L > K. In other words, we 
must have at least as many instruments as we have explanatory variables. If we do 
not have as many instruments as right-hand-side variables, then £ is not identified. 
However, L > K is no guarantee that 2SLS.2b holds: the elements of z might not be 
appropriately correlated with the elements of x. 

We already know how to test Assumption 2SLS.2b with a single endogenous ex- 
planatory variable. In the general case, it is possible to test Assumption 2SLS.2b, 
given a random sample on (x,z), essentially by performing tests on the sample ana- 
logue of E(z’x), Z'X/N. The tests are somewhat complicated; see, for example Cragg 
and Donald (1996). Often we estimate the reduced form for each endogenous ex- 
planatory variable to make sure that at least one element of z not in x is significant. 
This is not sufficient for the rank condition in general, but it can help us determine if 
the rank condition fails. 

Using linear projections, there is a simple way to see how Assumptions 2SLS.1 and 
2SLS.2 identify J. First, assuming that E(z’z) is nonsingular, we can always write 
the linear projection of x onto z as x* = zII, where I is the L x K matrix IT = 
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[E(z’z)| 'E(z'x). Since each column of TI can be consistently estimated by regressing 
the appropriate element of x onto z, for the purposes of identification of f, we can 
treat IT as known. Write x = x* +r, where E(z’r) = 0 and so E(x*’r) = 0. Now, the 
2SLS estimator is effectively the IV estimator using instruments x*. Multiplying 
equation (5.7) by x*’, taking expectations, and rearranging gives 


E(x"'x)f = E(x"), (5.21) 
since E(x*’u) = 0. Thus, £ is identified by £ = [E(x*’x)| 'E(x*’y) provided E(x*’x) is 
nonsingular. But 

E(x*’x) = HW'E(z'x) = E(x'z)[E(z’z)]'E(z’x) 

and this matrix is nonsingular if and only if E(z’x) has rank K; that is, if and only if 
Assumption 2SLS.2b holds. If 2SLS.2b fails, then E(x*’x) is singular and £ is not 
identified. (Note that, because x = x* + r with E(x*’r) = 0, E(x*/x) = E(x*’x*). So $ 
is identified if and only if rank E(x*’x*) = K.) 

The 2SLS estimator can be written as in equation (5.17) or as 


j- È xn) (ss. a) ($ ~) g > x) (>: in) (>: an), 


(5:22) 
We have the following consistency result. 


THEOREM 5.1 (Consistency of 2SLS): Under Assumptions 2SLS.1 and 2SLS.2, the 
2SLS estimator obtained from a random sample is consistent for £. 


Proof: Write 


N N al N 
(m Sx) (m Szin) Ge Sx) 
i=l i=l i=l 
N N S N 
f >) (sya (m$ xu) 
i=l i=l i=l 


and, using Assumptions 2SLS.1 and 2SLS.2, apply the law of large numbers to each 
term along with Slutsky’s theorem. 


-1 


p= B+ 
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5.2.2 Asymptotic Normality of Two-Stage Least Squares 


The asymptotic normality of /N(f — p) follows from the asymptotic normality of 
N~'?>% zlu, which follows from the central limit theorem under Assumption 
2SLS.1 and mild finite second-moment assumptions. The asymptotic variance is 
simplest under a homoskedasticity assumption: 


ASSUMPTION 2SLS.3: E(u?z'z) = o?E(z’z), where o? = E(u’). 


This assumption is the same as Assumption OLS.3 except that the vector of instru- 
ments appears in place of x. By the usual LIE argument, sufficient for Assumption 
2SLS.3 is the assumption 


E(u’ |z) =o", (5.23) 


which is the same as Var(u|z) = o° if E(u|z) =0. (When x contains endogenous 
elements, it makes no sense to make assumptions about Var(u | x).) 


THEOREM 5.2 (Asymptotic Normality of 2SLS): Under Assumptions 2SLS.1—2SLS.3, 
VN(B — B) is asymptotically normally distributed with mean zero and variance matrix 


o° ([E(X'Z)][E(2'2)]  E(z'x)]) = 0 E(x"), (5.24) 


where x* = zII is the 1 x K vector of linear projections. The right-hand side of 
equation (5.24) is convenient because it has the same form as the expression for the 
OLS estimator—see equation (4.9)—but with x* replacing x. In particular, we can 
easily obtain a simple expression for the asymptotic variance for a single coefficient: 
Avar/N (Br — Be) = o7/Var(r%), where r% is the population residual from regress- 
ing x; on X},...,Xx_, (where, typically, xf = 1). 


The proof of Theorem 5.2 is similar to Theorem 4.2 for OLS and is therefore 
omitted. 

The matrix in expression (5.24) is easily estimated using sample averages. To esti- 
mate g? we will need appropriate estimates of the u;. Define the 2SLS residuals as 


a; = yi — xB, $= l 2era N. (5.25) 


Note carefully that these residuals are not the residuals from the second-stage OLS 
regression that can be used to obtain the 2SLS estimates. The residuals from the 
second-stage regression are y; — X,B. Any 2SLS software routine will compute equa- 
tion (5.25) as the 2SLS residuals, and these are what we need to estimate o°. 

Given the 2SLS residuals, a consistent (though not unbiased) estimator of o? under 
Assumptions 2SLS.1—2SLS.3 is 
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e= (WK) Soe (5.26) 


Many regression packages use the degrees-of-freedom adjustment N — K in place of 
N, but this usage does not affect the consistency of the estimator. 
The K x K matrix 


-1 
ô? 2 vs) = 6°(X'X)! (5.27) 


is a valid estimator of the asymptotic variance of Ê under Assumptions 2SLS.1— 
2SLS.3. The (asymptotic) standard error of Ê; is just the square root of the jth diag- 
onal element of matrix (5.27). Asymptotic confidence intervals and ¢ statistics are 
obtained in the usual fashion. 


Example 5.3 (Parents and Husband’s Education as IVs): We use the data on the 
428 working, married women in MROZ.RAW to estimate the wage equation (5.12). 
We assume that experience is exogenous, but we allow educ to be correlated with u. 
The instruments we use for educ are motheduc, fatheduc, and huseduc. The reduced 
form for educ is 


educ = ôo + 6, exper + d2exper? + 0,motheduc + 02 fatheduc + Ozhuseduc + r. 


Assuming that motheduc, fatheduc, and huseduc are exogenous in the log(wage) 
equation (a tenuous assumption), equation (5.12) is identified if at least one of 01, 02, 
and 63 is nonzero. We can test this assumption using an F test (under homoskedas- 
ticity). The F statistic (with 3 and 422 degrees of freedom) turns out to be 104.29, 
which implies a p-value of zero to four decimal places. Thus, as expected, educ is 
fairly strongly related to motheduc, fatheduc, and huseduc. (Each of the three ¢ sta- 
tistics is also very significant.) 
When equation (5.12) is estimated by 2SLS, we get the following: 


log(wage) = —.187+ .043 exper — .00086 exper? + .080 educ, 
(.285) (.013) (.00040) (.022) 


where standard errors are in parentheses. The 2SLS estimate of the return to educa- 
tion is about 8 percent, and it is statistically significant. For comparison, when 
equation (5.12) is estimated by OLS, the estimated coefficient on educ is about .107 
with a standard error of about .014. Thus, the 2SLS estimate is notably below the 
OLS estimate and has a larger standard error. 
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5.2.3 Asymptotic Efficiency of Two-Stage Least Squares 
The appeal of 2SLS comes from its efficiency in a class of IV estimators: 


THEOREM 5.3 (Relative Efficiency of 2SLS): Under Assumptions 2SLS.1—2SLS.3, 
the 2SLS estimator is efficient in the class of all instrumental variables estimators 
using instruments linear in z. 


Proof: Let B be the 2SLS estimator, and let f be any other IV estimator using 
instruments linear in z. Let the instruments for f be X = zI, where IT is an L x K 
nonstochastic matrix. (Note that z is the 1 x L random vector in the population.) 
We assume that the rank condition holds for x. For 2SLS, the choice of IVs is 
effectively x* = zI, where I = [E(z’z)]~'E(z’x) = D~'C. (In both cases, we can re- 
place F and I with //N-consistent estimators without changing the asymptotic vari- 
ances.) Now, under Assumptions 2SLS.1—2SLS.3, we know the asymptotic variance 
of VN(B — B) is o?2[E(x*’x*)]"', where x* = II. It is straightforward to show that 
Avar|VN(f — B)] = o?[E(%’x)]' [E(&'x)][E(x’x)]'. To show that Avar[VN(B — B)| 
— Avar|VN(B — B)] is positive semidefinite (p.s.d.), it suffices to show that E(x*/x*) — 
E(x’x) [E(x’x)]'E(x’x) is p.s.d. But x = x* +r, where E(z’r) = 0, and so E(’r) = 0. 
It follows that E(x’x) = E(x’x*), and so 


E(x*’x*) — E(x’x)[E(x’x)] "| E(x’x) 
= E(x*’x*) — E(x*’%)[E(%’%)]"'E(x’x*) = E(s*’s*), 


where s* = x* — L(x* |x) is the population residual from the linear projection of x* 
on x. Because E(s*’s*) is p.s.d, the proof is complete. 


Theorem 5.3 is vacuous when L = K because any (nonsingular) choice of I leads 
to the same estimator: the IV estimator derived in Section 5.1.1. 

When x is exogenous, Theorem 5.3 implies that, under Assumptions 2SLS.1— 
2SLS.3, the OLS estimator is efficient in the class of all estimators using instruments 
linear in all exogenous variables z. Why? Because x is a subset of z and so L(x | z) = x. 

Another important implication of Theorem 5.3 is that, asymptotically, we always 
do better by using as many instruments as are available, at least under homo- 
skedasticity. This conclusion follows because using a subset of z as instruments cor- 
responds to using a particular linear combination of z. For certain subsets we might 
achieve the same efficiency as 2SLS using all of z, but we can do no better. This ob- 
servation makes it tempting to add many instruments so that L is much larger than 
K. Unfortunately, 2SLS estimators based on many overidentifying restrictions can 
cause finite sample problems; see Section 5.2.6. 
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Because Assumption 2SLS.3 is assumed for Theorem 5.3, it is not surprising that 
more efficient estimators are available if Assumption 2SLS.3 fails. If L > K, a more 
efficient estimator than 2SLS exists, as shown by Hansen (1982) and White (1982b, 
1984). In fact, even if x is exogenous and Assumption OLS.3 holds, OLS is not gen- 
erally asymptotically efficient if, for x = z, Assumptions 2SLS.1 and 2SLS.2 hold but 
Assumption 2SLS.3 does not. Obtaining the efficient estimator falls under the rubric 
of generalized method of moments estimation, something we cover in Chapter 8. 


5.2.4 Hypothesis Testing with Two-Stage Least Squares 


We have already seen that testing hypotheses about a single p; is straightforward us- 
ing an asymptotic ¢ statistic, which has an asymptotic normal distribution under the 
null; some prefer to use the ¢ distribution when N is small. Generally, one should be 
aware that the normal and ¢ approximations can be poor if N is small. Hypotheses 
about single linear combinations involving the ; are also easily carried out using a f 
statistic. The easiest procedure is to define the linear combination of interest, say 
0 = aif; + af) + +++ + axfx, and then to write one of the f; in terms of 0 and the 
other elements of $. Then, substitute into the equation of interest so that 0 appears 
directly, and estimate the resulting equation by 2SLS to get the standard error of 6. 
See Problem 5.9 for an example. 

To test multiple linear restrictions of the form Ho: Rf = r, the Wald statistic is just 
as in equation (4.13), but with V given by equation (5.27). The Wald statistic, as 
usual, is a limiting null Xo distribution. Some econometrics packages, such as Stata, 
compute the Wald statistic (actually, its F statistic counterpart, obtained by dividing 
the Wald statistic by Q) after 2SLS estimation using a simple test command. 

A valid test of multiple restrictions can be computed using a residual-based 
method, analogous to the usual F statistic from OLS analysis. Any kind of linear re- 
striction can be recast as exclusion restrictions, and so we explicitly cover exclusion 
restrictions. Write the model as 


Y = XP) + Xf, + u, (5.28) 
where x; is 1 x Kı and x) is 1 x K, and interest lies in testing the K, restrictions 
Ho: fa = 0 against Hi: fa #0. (5.29) 


Both x; and x, can contain endogenous and exogenous variables. 

Let z denote the L > Kı + Kə vector of instruments, and we assume that the rank 
condition for identification holds. Justification for the following statistic can be found 
in Wooldridge (1995b). 
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Let û; be the 2SLS residuals from estimating the unrestricted model using z; as 
instruments. Using these residuals, define the 2SLS unrestricted sum of squared 
residuals by 


N 
SSR = $ a. (5.30) 


i=1 


In order to define the F statistic for 2SLS, we need the sum of squared residuals from 
the second-stage regressions. Thus, let $; be the 1 x K; fitted values from the first- 
stage regression x; on z;. Similarly, x;2 are the fitted values from the first-stage re- 
gression X;2 on z;. Define SSR,, as the usual sum of squared residuals from the 
unrestricted second-stage regression y on $1, %2. Similarly, SSR, is the sum of squared 
residuals from the restricted second-stage regression, y on $. It can be shown that, 
under Ho: £ = 0 (and Assumptions 2SLS.1-2SLS.3), N - (SSR, — SSRy,)/SSRur £ 
X3. It is just as legitimate to use an F-type statistic: 
(SSR, — SSRw) (N — K) 


F= . 5.31 
SSR „ Kı ( ) 


is distributed approximately as Fx, N-K- 

Note carefully that SSR, and SSR., appear in the numerator of (5.31). These 
quantities typically need to be computed directly from the second-stage regression. In 
the denominator of F is SSR,,, which is the 2SLS sum of squared residuals. This is 
what is reported by the 2SLS commands available in popular regression packages. 

For 2SLS, it is important not to use a form of the statistic that would work for 
OLS, namely, 

(SSR, — SSR,,) (N — K) 


; 5.32 
SSR „ K ’ pee 


where SSR, is the 2SLS restricted sum of squared residuals. Not only does expression 
(5.32) not have a known limiting distribution, but it can also be negative with positive 
probability even as the sample size tends to infinity; clearly, such a statistic cannot 
have an approximate F distribution, or any other distribution typically associated 
with multiple hypothesis testing. 


Example 5.4 (Parents’ and Husband’s Education as IVs, continued): We add the 
number of young children (Aids/t6) and older children (kidsge6) to equation (5.12) 
and test for their joint significance using the Mroz (1987) data. The statistic in equa- 
tion (5.31) is F = .31; with 2 and 422 degrees of freedom, the asymptotic p-value is 
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about .737. There is no evidence that number of children affects the wage for working 
women. 


Rather than equation (5.31), we can compute an L-type statistic for testing hy- 
pothesis (5.29). Let a; be the 2SLS residuals from the restricted model. That is, obtain 
Ê. from the model y = xı f; + u using instruments z, and let ù; = y; — X; B. Letting 
X;, and Xj. be defined as before, the LM statistic is obtained as NR? from the 
regression 


Uj on Åi, Ŝi, {= 1,2,...,N (5.33) 


where R2 is generally the uncentered R-squared. (That is, the total sum of squares in 
the denominator of R-squared is not demeaned.) When {z;} has a zero sample aver- 
age, the uncentered R-squared and the usual R-squared are the same. This is the case 
when the null explanatory variables x; and the instruments z both contain unity, the 
typical case. Under Ho and Assumptions 2SLS.1—-2SLS.3, LM ~ %3, . Whether one 
uses this statistic or the F statistic in equation (5.31) is primarily a matter of taste; 
asymptotically, there is nothing that distinguishes the two. 


5.2.5 Heteroskedasticity-Robust Inference for Two-Stage Least Squares 


Assumption 2SLS.3 can be restrictive, so we should have a variance matrix estimator 
that is robust in the presence of heteroskedasticity of unknown form. As usual, we 
need to estimate B along with A. Under Assumptions 2SLS.1 and 2SLS.2 only, 


Avar(f) can be estimated as 
A'i! (£ ws X'$). (5.34) 


Sometimes this matrix is multiplied by N/(N — K) as a degrees-of-freedom adjust- 
ment. This heteroskedasticity-robust estimator can be used anywhere the estimator 
6?(X'X)_' is. In particular, the square roots of the diagonal elements of the matrix 
(5.34) are the heteroskedasticity-robust standard errors for 2SLS. These can be used 
to construct (asymptotic) ¢ statistics in the usual way. Some packages compute these 
standard errors using a simple command. For example, using Stata, rounded to three 
decimal places the heteroskedasticity-robust standard error for educ in Example 5.3 is 
.022, which is the same as the usual standard error rounded to three decimal places. 
The robust standard error for exper is .015, somewhat higher than the nonrobust one 
(.013). 

Sometimes it is useful to compute a robust standard error that can be computed 
with any regression package. Wooldridge (1995b) shows how this procedure can be 
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carried out using an auxiliary linear regression for each parameter. Consider com- 
puting the robust standard error for Ê. Let “‘se( Bi)” denote the standard error com- 
puted using the usual variance matrix (5.27); we put this in quotes because it is no 
longer appropriate if Assumption 2SLS.3 fails. The ô is obtained from equation 
(5.26), and û; are the 2SLS residuals from equation (5.25). Let ĉj be the residuals 
from the regression 


Xi on Xia Kiyo cay MKF 1G RE PEL 8 KiK i= 1 2iccag dV 


and define ñy = X^; #jti;. Then, a heteroskedasticity-robust standard error of Ê; can 
be calculated as 


se(B;) = [N/(N — K)" sel)?” /6]?/ y)". (5.35) 


Many econometrics packages compute equation (5.35) for you, but it is also easy to 
compute directly. 

To test multiple linear restrictions using the Wald approach, we can use the usual 
statistic but with the matrix (5.34) as the estimated variance. For example, the 
heteroskedasticity-robust version of the test in Example 5.4 gives F = .25; asymp- 
totically, F can be treated as an F> 42 variate. The asymptotic p-value is .781. 

The Lagrange multiplier test for omitted variables is easily made heteroskedasticity- 
robust. Again, consider the model (5.28) with the null (5.29), but this time with- 
out the homoskedasticity assumptions. Using the notation from before, let r; = 
(îi, fi2,..-, fig) be the 1 x K vectors of residuals from the multivariate regression 
$; on Xj, i= 1,2,...,N. (Again, this procedure can be carried out by regressing 
each element of X;2 on all of x;;.) Then, for each observation, form the 1 x K> vector 
uj f; = (Ù; - fi,- - , Ui + figa). Then, the robust LM test is N — SSRo from the regres- 
sion 1 on a+ fa,..., e Îig,, i= 1,2,..., N. Under Ho, N — SSRo ~ %2. This pro- 
cedure can be justified in a manner similar to the tests in the context of OLS. You are 
referred to Wooldridge (1995b) for details. 


5.2.6 Potential Pitfalls with Two-Stage Least Squares 


When properly applied, the method of instrumental variables can be a powerful tool 
for estimating structural equations using nonexperimental data. Nevertheless, there 
are some problems that one can encounter when applying IV in practice. 

One thing to remember is that, unlike OLS under a zero conditional mean as- 
sumption, IV methods are never unbiased when at least one explanatory variable is 
endogenous in the model. In fact, under standard distributional assumptions, the 
expected value of the 2SLS estimator does not even exist. As shown by Kinal (1980), 
in the case when all endogenous variables have homoskedastic normal distributions 
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with expectations linear in the exogenous variables, the number of moments of the 
2SLS estimator that exist is one fewer than the number of overidentifying restrictions. 
This finding implies that when the number of instruments equals the number of ex- 
planatory variables, the IV estimator does not have an expected value. This is one 
reason we rely on large-sample analysis to justify 2SLS. 

Even in large samples, IV methods can be ill-behaved if we have weak instruments. 
Consider the simple model y = fy + f,x1 + u, where we use z; as an instrument for 
xı. Assuming that Cov(z1, x1) # 0, the plim of the IV estimator is easily shown to be 


plim Bi = PB, + Cov(z1, u)/Cov(z1, x1), 


which can be written as 


plim Âi = B, + (¢/ox,)[Corr(z1, u) /Corr(z1, x1)], (5.36) 


where Corr(-,-) denotes correlation. From this equation we see that if zı and u are 
correlated, the inconsistency in the IV estimator gets arbitrarily large as Corr(z1, x1) 
gets close to zero. Thus, seemingly small correlations between zı and u can cause 
severe inconsistency—and therefore severe finite sample bias—if zı is only weakly 
correlated with x;. In fact, it may be better to just use OLS, even if we look only at 
the inconsistencies in the estimators. To see why, let Ê, denote the OLS estimator and 
write its plim as 


plim(B,) = B, + (6u/ox,) Corr(x1,u). (5.37) 


Comparing equations (5.37) and (5.36), the signs of the inconsistency of OLS and 
IV can be different. Further, the magnitude of the inconsistency in OLS is smaller 
than that of the IV estimator if |Corr(x),u)| -|Corr(z),x1)| < |Corr(z,u)|. This 
simple inequality makes it apparent that a weak instrument—captured here by 
small |Corr(z1,x1)|—-can easily make IV have more asymptotic bias than OLS. 
Unfortunately, we never observe u, so we cannot know how large |Corr(z1,u)| is 
relative to |Corr(x,,u)|. But what is certain is that a small correlation between z; and 
x;—which we can estimate—should raise concerns, even if we think zı is “almost” 
exogenous. 

Another potential problem with applying 2SLS and other IV procedures is that the 
2SLS standard errors have a tendency to be “large.” What is typically meant by this 
statement is either that 2SLS coefficients are statistically insignificant or that the 
2SLS standard errors are much larger than the OLS standard errors. Not suprisingly, 
the magnitudes of the 2SLS standard errors depend, among other things, on the 
quality of the instrument(s) used in estimation. 
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For the following discussion we maintain the standard 2SLS Assumptions 2SLS.1— 
2SLS.3 in the model 


Y = Bo + Bix + Box. + +++ + BxK + u. (5.38) 


Let Ê be the vector of 2SLS estimators using instruments z. For concreteness, we focus 
on the asymptotic variance of fg. Technically, we should study Avar VN (Bx — Bx), 
but it is easier to work with an expression that contains the same information. In 
particular, we use the fact that 


g2 


Avar( fk) x ——., (5.39) 
£ §8Rx 


where SSRx is the sum of squared residuals from the regression 
XK on 1, Figes XRAY (5.40) 


(Remember, if x; is exogenous for any j, then £; = x;.) If we replace g? in regression 
(5.39) with 67, then expression (5.39) is the usual 2SLS variance estimator. For the 
current discussion we are interested in the behavior of SSRx. 

From the definition of an R-squared, we can write 


SSRx = SSTx(1 — R2), (5.41) 


where SSTx is the total sum of squares of Xx in the sample, SSTx = 
SA (tix — Fx)’, and R% is the R-squared from regression (5.40). In the context 
of OLS, the term (1 — R%) in equation (5.41) is viewed as a measure of multi- 
collinearity, whereas SSTx measures the total variation in xx. We see that, in addi- 
tion to traditional multicollinearity, 2SLS can have an additional source of large 
variance: the total variation in Xx can be small. 

When is SST x small? Remember, Xx denotes the fitted values from the regression 


Xx OnNZ (5.42) 


Therefore, SSTx is the same as the explained sum of squares from the regression 
(5.42). If xx is only weakly related to the IVs, then the explained sum of squares from 
regression (5.42) can be quite small, causing a large asymptotic variance for By. If 
Xx is highly correlated with z, then SSTx can be almost as large as the total sum of 
squares of xx and SSTx, and this fact reduces the 2SLS variance estimate. 

When xx is exogenous—whether or not the other elements of x are—SSTx = 
SSTx. While this total variation can be small, it is determined only by the sample 
variation in {xjx:i=1,2,...,N}. Therefore, for exogenous elements appearing 
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among x, the quality of instruments has no bearing on the size of the total sum of 
squares term in equation (5.41). This fact helps explain why the 2SLS estimates 
on exogenous explanatory variables are often much more precise than the coeffi- 
cients on endogenous explanatory variables. 

In addition to making the term SSTx small, poor quality of instruments can lead to 
Ry close to one. As an illustration, consider a model in which xx is the only endog- 
enous variable and there is one instrument z; in addition to the exogenous variables 
(1,x1,-.-,Xx-1). Therefore, z = (1,%1,...,Xx-1,21). (The same argument works for 
multiple instruments.) The fitted values Xx come from the regression 


XK on 1, Xl; ..., XK-1; Z1. (5.43) 


Because all other regressors are exogenous (that is, they are included in z), RE comes 
from the regression 


Xx on l, X1,...,XK-1.- (5.44) 


Now, from basic least squares mechanics, if the coefficient on z; in regression (5.43) is 
exactly zero, then the R-squared from regression (5.44) is exactly unity, in which case 
the 2SLS estimator does not even exist. This outcome virtually never happens, but 
zı could have little explanatory value for xx once x),...,xx_, have been controlled 
for, in which case RŽ can be close to one. Identification, which only has to do with 
whether we can consistently estimate f, requires only that zı appear with nonzero 
coefficient in the population analogue of regression (5.43). But if the explanatory 
power of z; is weak, the asymptotic variance of the 2SLS estimator can be quite 
large. This is another way to illustrate why nonzero correlation between xx and zı is 
not enough for 2SLS to be effective: the partial correlation is what matters for the 
asymptotic variance. 

Shea (1997) uses equation (5.39) and the analogous formula for OLS to define a 
measure of “instrument relevance” for xx (when it is treated as endogenous). The 
measure is simply SSRx/SSRx, where SSRx is the sum of squared residiuals from 
the regression xg on X|,...,Xx_1. In other words, the measure is essentially the ratio 
of the asymptotic variance estimator of OLS (when it is consistent) to the asymptotic 
variance estimator of the 2SLS estimator. Shea (1997) notes that the measure also 
can be computed as the squared correlation between two sets of residuals, the first 


obtained from regressing £x on X;,...,x_, and the second obtained from regressing 
XK ON X),...,XxK_-1. Further, the probability limit of SSRx/SSRx appears in the in- 


consistency of the 2SLS estimator (relative to the OLS estimator), but with some 
other terms that cannot be estimated. 
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Shea’s measure is useful for summarizing the strength of the instruments in a single 
number. Nevertheless, like measures of multicollinearity, Shea’s measure of instru- 
ment relevance has its shortcomings. In effect, it uses OLS as a benchmark and 
records how much larger the 2SLS standard errors are than the OLS standard errors. 
It says nothing directly about whether the 2SLS estimator is precise enough for 
inference purposes or whether the asymptotic normal approximation to the actual 
distribution of the 2SLS is acceptable. And, of course, it is silent on whether the 
instruments are actually exogenous. 

We are in a difficult situation when the 2SLS standard errors are so large that 
nothing is significant. Often we must choose between a possibly inconsistent estima- 
tor that has relatively small standard errors (OLS) and a consistent estimator that is 
so imprecise that nothing interesting can be concluded (2SLS). One approach is to 
use OLS unless we can reject exogeneity of the explanatory variables. We show how 
to test for endogeneity of one or more explanatory variables in Section 6.2.1. 

There has been some important recent work on the finite sample properties of 
2SLS that emphasizes the potentially large biases of 2SLS, even when sample sizes 
seem to be quite large. Remember that the 2SLS estimator is never unbiased (pro- 
vided one has at least one truly endogenous variable in x). But we hope that, with a 
very large sample size, we need only weak instruments to get an estimator with small 
bias. Unfortunately, this hope is not fulfilled. For example, Bound, Jaeger, and Baker 
(1995) show that in the setting of Angrist and Krueger (1991), the 2SLS estimator 
can be expected to behave quite poorly, an alarming finding because Angrist and 
Krueger use 300,000 to 500,000 observations! The problem is that the instruments— 
representing quarters of birth and various interactions of these with year of birth and 
state of birth—are very weak, and they are too numerous relative to their contribu- 
tion in explaining years of education. One lesson is that, even with a very large sample 
size and zero correlation between the instruments and error, we should not use too 
many overidentifying restrictions. 

Staiger and Stock (1997) provide a theoretical analysis of the 2SLS estimator (and 
related estimators) with weak instruments. Formally, they model the weak instru- 
ment problem as one where the coefficients on the instruments in the reduced form of 
the exogenous variables converge to zero as the sample increases at the rate 1/ VN; in 
the limit, the model is not identified. (This device is not intended to capture the way 
data are actually generated; its usefulness is in the resulting approximations to the 
exact distribution of the IV estimators.) For example, in the simple regression model 
with a single instrument, the reduced form is assumed to be x; = zo + (p,/WN) zi + 
vi, i= 1,2,...,N, where p, #0. Because p,/N — 0, the resulting asymptotic 
analysis is much different from the first-order asymptotic theory we have covered in 
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this chapter; in fact, the 2SLS estimator is inconsistent and has a nonnormal limiting 
distribution not centered at the population parameter value. 

One lesson that comes out of the Staiger-Stock work is that we should always 
compute the F statistics from first-stage regressions (or the ¢ statistic with a single 
instrumental variable). Staiger and Stock (1997) provide some guidelines about how 
large the first-stage F statistic should be (equivalently, how small the associated 
p-value should be) for 2SLS to have acceptable statistical properties. These guidelines 
should be generally helpful, but they are derived assuming the endogenous explana- 
tory variables have linear conditional expectations and constant variance conditional 
on the exogenous variables. 


5.3 IV Solutions to the Omitted Variables and Measurement Error Problems 


In this section, we briefly survey the different approaches that have been suggested 
for using IV methods to solve the omitted variables problem. Section 5.3.2 covers an 
approach that applies to measurement error as well. 


5.3.1 Leaving the Omitted Factors in the Error Term 
Consider again the omitted variable model 
Y = Po + pixi +: + Bex +yq +2, (5.45) 


where q represents the omitted variable and E(v |x, q) = 0. The solution that would 
follow from Section 5.1.1 is to put q in the error term, and then to find instruments 
for any element of x that is correlated with q. It is useful to think of the instruments 
satisfying the following requirements: (1) they are redundant in the structural model 
E(y|x,q); (2) they are uncorrelated with the omitted variable, q; and (3) they are 
sufficiently correlated with the endogenous elements of x (that is, those elements that 
are correlated with q). Then 2SLS applied to equation (5.45) with u = yq + v pro- 
duces consistent and asymptotically normal estimators. 


5.3.2 Solutions Using Indicators of the Unobservables 


An alternative solution to the omitted variable problem is similar to the OLS proxy 
variable solution but requires IV rather than OLS estimation. In the OLS proxy 
variable solution we assume that we have zı such that q = 09 + 01z1 +11, where r; is 
uncorrelated with zı (by definition) and is uncorrelated with x1,...,xx« (the key proxy 
variable assumption). Suppose instead that we have two indicators of g. Like a proxy 
variable, an indicator of q must be redundant in equation (5.45). The key difference is 
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that an indicator can be written as 


qı =ô +019 + a1, (5.46) 
where 
Cov(q, a) = 0, Cov(x, ai) =0. (5.47) 


This assumption contains the classical errors-in-variables model as a special case, 
where q is the unobservable, qı is the observed measurement, 69 = 0, and 6; = 1, in 
which case y in equation (5.45) can be identified. 

Assumption (5.47) is very different from the proxy variable assumption. Assuming 
that 6; # 0—otherwise, qı is not correlated with q—we can rearrange equation (5.46) 
as 


q = —(60/01) + (1/81)qı — (1/61), (5.48) 


where the error in this equation, —(1/0,)a1, is necessarily correlated with qı; the 
OLS-proxy variable solution would be inconsistent. 

To use the indicator assumption (5.47), we need some additional information. One 
possibility is to have a second indicator of q: 


q2 = Po + Pid + 2, (5.49) 


where a satisfies the same assumptions as a; and p; #0. We still need one more 
assumption: 


Cov(a, a2) = 0. (5.50) 


This implies that any correlation between qı and q2 arises through their common 
dependence on q. 
Plugging qı in for q and rearranging gives 


Y= %0 +xB+ yıqı + (w= 7141), (5.51) 


where y; = y/ô1. Now, q2 is uncorrelated with v because it is redundant in equation 
(5.45). Further, by assumption, q2 is uncorrelated with a; (a; is uncorrelated with q 
and a2). Since qı and q2 are correlated, q2 can be used as an IV for qı in equation 
(5.51). Of course, the roles of q2 and qı can be reversed. This solution to the omitted 
variables problem is sometimes called the multiple indicator solution. 

It is important to see that the multiple indicator IV solution is very different from 
the IV solution that leaves q in the error term. When we leave q as part of the error, 
we must decide which elements of x are correlated with g, and then find IVs for those 
elements of x. With multiple indicators for q, we need not know which elements of x 
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are correlated with q; they all might be. In equation (5.51) the elements of x serve as 
their own instruments. Under the assumptions we have made, we only need an in- 
strument for qı, and q serves that purpose. 


Example 5.5 (IQ and KWW as Indicators of Ability): We apply the indicator 
method to the model of Example 4.3, using the 935 observations in NLS80.RAW. In 
addition to JQ, we have a knowledge of the working world (KWW) test score. If we 
write JO = 69 + ĉiabil + a,, KWW = po + p,abil + az, and the previous assumptions 
are satisfied in equation (4.29), then we can add JQ to the wage equation and use 
KWwW as an instrument for JQ. We get 


— 


log(wage) = 4.59 + .014 exper + .010 tenure + .201 married 
(0.33) (.003) (.003) (.041) 


— .051 south + 177 urban — .023 black + .025 educ + .013 IQ. 
(.031) (.028) (.074) (.017) (.005) 


The estimated return to education is about 2.5 percent, and it is not statistically sig- 
nificant at the 5 percent level even with a one-sided alternative. If we reverse the roles 
of KWW and IQ, we get an even smaller return to education: about 1.7 percent, with 
a t statistic of about 1.07. The statistical insignificance is perhaps not too surprising 
given that we are using IV, but the magnitudes of the estimates are surprisingly small. 
Perhaps a; and a are correlated with each other, or with some elements of x. 


In the case of the CEV measurement error model, qı and q2 are measures of 
q assumed to have uncorrelated measurement errors. Since ôo = pọ = 0 and 6; = 
pı = 1,7, = y. Therefore, having two measures, where we plug one into the equation 
and use the other as its instrument, provides consistent estimators of all parameters in 
the CEV setup. 

There are other ways to use indicators of an omitted variable (or a single mea- 
surement in the context of measurement error) in an IV approach. Suppose that only 
one indicator of q is available. Without further information, the parameters in the 
structural model are not identified. However, suppose we have additional variables 
that are redundant in the structural equation (uncorrelated with v), are uncorrelated 
with the error a; in the indicator equation, and are correlated with q. Then, as you 
are asked to show in Problem 5.7, estimating equation (5.51) using this additional 
set of variables as instruments for qı produces consistent estimators. This is almost 
the method proposed by Griliches and Mason (1972). As discussed by Cardell and 
Hopkins (1977), Griliches and Mason incorrectly restrict the reduced form for q1; see 
Problem 5.11. Blackburn and Neumark (1992) correctly implement the IV method. 
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Problems 


5.1. In this problem you are to establish the algebraic equivalence between 2SLS 
and OLS estimation of an equation containing an additional regressor. Although the 
result is completely general, for simplicity consider a model with a single (suspected) 
endogenous variable: 


Yı = 210; + %12 + u1, 
Yı = 102 + V2. 


For notational clarity, we use y, as the suspected endogenous variable and z as the 
vector of all exogenous variables. The second equation is the reduced form for yp. 
Assume that z has at least one more element than z1. 

We know that one estimator of (6), %1) is the 2SLS estimator using instruments x. 
Consider an alternative estimator of (0), a1): (a) estimate the reduced form by OLS, 
and save the residuals #7; (b) estimate the following equation by OLS: 


Yı = 210; + &y2 + p12 + error. (5.52) 


Show that the OLS estimates of 6; and « from this regression are identical to the 
2SLS estimators. (Hint: Use the partitioned regression algebra of OLS. In particular, 
if > = x,B, + xo, is an OLS regression, B, can be obtained by first regressing x 
on X2, getting the residuals, say x,, and then regressing y on X;; see, for example, 
Davidson and MacKinnon (1993, Section 1.4). You must also use the fact that zı and 
ô are orthogonal in the sample.) 


5.2. Consider a model for the health of an individual: 
health = By + iage + B,weight + f3height 
+ Bymale + Bswork + Beexercise + uy, (5.53) 


where health is some quantitative measure of the person’s health; age, weight, height, 
and male are self-explanatory; work is weekly hours worked; and exercise is the hours 
of exercise per week. 

a. Why might you be concerned about exercise being correlated with the error term 
u? 

b. Suppose you can collect data on two additional variables, disthome and distwork, 
the distances from home and from work to the nearest health club or gym. Discuss 
whether these are likely to be uncorrelated with u. 
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c. Now assume that disthome and distwork are in fact uncorrelated with u1, as are all 
variables in equation (5.53) with the exception of exercise. Write down the reduced 
form for exercise, and state the conditions under which the parameters of equation 
(5.53) are identified. 


d. How can the identification assumption in part c be tested? 


5.3. Consider the following model to estimate the effects of several variables, in- 
cluding cigarette smoking, on the weight of newborns: 


log(bwght) = By + B,male + p parity + p3 log( faminc) + By packs + u, (5.54) 


where male is a binary indicator equal to one if the child is male, parity is the birth 
order of this child, faminc is family income, and packs is the average number of packs 
of cigarettes smoked per day during pregnancy. 


a. Why might you expect packs to be correlated with u? 


b. Suppose that you have data on average cigarette price in each woman’s state of 
residence. Discuss whether this information is likely to satisfy the properties of a 
good instrumental variable for packs. 


c. Use the data in BWGHT.RAW to estimate equation (5.54). First, use OLS. Then, 
use 2SLS, where cigprice is an instrument for packs. Discuss any important differ- 
ences in the OLS and 2SLS estimates. 


d. Estimate the reduced form for packs. What do you conclude about identification 
of equation (5.54) using cigprice as an instrument for packs? What bearing does this 
conclusion have on your answer from part c? 


5.4. Use the data in CARD.RAW for this problem. 


a. Estimate a log(wage) equation by OLS with educ, exper, exper’, black, south, 
smsa, reg661 through reg668, and smsa66 as explanatory variables. Compare your 
results with Table 2, Column (2) in Card (1995). 


b. Estimate a reduced form equation for educ containing all explanatory variables from 
part a and the dummy variable nearc4. Do educ and nearc4 have a practically and sta- 
tistically significant partial correlation? (See also Table 3, Column (1) in Card (1995).) 


c. Estimate the log(wage) equation by IV, using nearc4 as an instrument for educ. 
Compare the 95 percent confidence interval for the return to education with that 
obtained from part a. (See also Table 3, Column (5) in Card (1995).) 

d. Now use nearc2 along with nearc4 as instruments for educ. First estimate the 
reduced form for educ, and comment on whether nearc2 or nearc4 is more strongly 
related to educ. How do the 2SLS estimates compare with the earlier estimates? 
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e. For a subset of the men in the sample, IQ score is available. Regress ig on nearc4. 
Is IQ score uncorrelated with nearc4? 


f. Now regress ig on nearc4 along with smsa66, reg661, reg662, and reg669. Are ig 
and nearc4 partially correlated? What do you conclude about the importance of 
controlling for the 1966 location and regional dummies in the log(wage) equation 
when using nearc4 as an IV for educ? 


5.5. One occasionally sees the following reasoning used in applied work for choos- 
ing instrumental variables in the context of omitted variables. The model is 


Yı = Zð; + M2 + y4 + 41, 


where q is the omitted factor. We assume that a, satisfies the structural error as- 
sumption E(a; | Z1, y2,q) = 0, that zı is exogenous in the sense that E(q|z,) = 0, but 
that y, and q may be correlated. Let z? be a vector of instrumental variable candi- 
dates for y). Suppose it is known that z) appears in the linear projection of y, onto 
(z1,Z2), and so the requirement that z? be partially correlated with y, is satisfied. 
Also, we are willing to assume that z is redundant in the structural equation, so that 
a, is uncorrelated with z2. What we are unsure of is whether z is correlated with the 
omitted variable q, in which case z2 would not contain valid IVs. 

To “test” whether z) is in fact uncorrelated with q, it has been suggested to use 
OLS on the equation 


Yı = 210; + 12+ Zy, + 1, (5.55) 
where uw; = yq + a, and test Ho: yı = 0. Why does this method not work? 


5.6. Refer to the multiple indicator model in Section 5.3.2. 


a. Show that if gz is uncorrelated with x;, j = 1,2,...,K, then the reduced form of 
qı depends only on q2. (Hint: Use the fact that the reduced form of qı is the linear 
projection of qı onto (1, x1, X2,..., Xg, q2) and find the coefficient vector on x using 
Property LP.7 from Chapter 2.) 


b. What happens if g2 and x are correlated? In this setting, is it realistic to assume 
that q2 and x are uncorrelated? Explain. 


5.7. Consider model (5.45) where v has zero mean and is uncorrelated with 
X1,...,Xg« and q. The unobservable q is thought to be correlated with at least some of 
the x;. Assume without loss of generality that E(q) = 0. 

You have a single indicator of q, written as qı = ô1q + a1, 06; #0, where a, has 
zero mean and is uncorrelated with each of xj, q, and v. In addition, 2), 22,...,Z isa 
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set of variables that are (1) redundant in the structural equation (5.45) and (2) 
uncorrelated with a1. 


a. Suggest an IV method for consistently estimating the £;. Be sure to discuss what is 
needed for identification. 


b. If equation (5.45) is a log(wage) equation, q is ability, qı is JQ or some other test 
score, and z),...,Z, are family background variables, such as parents’ education and 
number of siblings, describe the economic assumptions needed for consistency of the 
the IV procedure in part a. 


c. Carry out this procedure using the data in NLS80.RAW. Include among the ex- 
planatory variables exper, tenure, educ, married, south, urban, and black. First use IQ 
as qı and then KWW. Include in the z, the variables meduc, feduc, and sibs. Discuss 
the results. 


5.8. Consider a model with unobserved heterogeneity (q) and measurement error in 
an explanatory variable: 


Y= Po + pixi +: + prXk ++, 


where ex = Xx — x% is the measurement error and we set the coefficient on q equal to 
one without loss of generality. The variable q might be correlated with any of the 
explanatory variables, but an indicator, qı = 69 +01q + a1, is available. The mea- 
surement error ex might be correlated with the observed measure, xx. In addition to 
qı, you also have variables z1, z2,...,Zm, M > 2, that are uncorrelated with v, a, 
and ex. 

a. Suggest an IV procedure for consistently estimating the $, Why is M >2 
required? (Hint: Plug in q for q and xx for x;, and go from there.) 

b. Apply this method to the model estimated in Example 5.5, where actual educa- 


tion, say educ*, plays the role of xz. Use JQ as the indicator of q = ability, and 
KWW, meduc, feduc, and sibs as the elements of z. 


5.9. Suppose that the following wage equation is for working high school graduates: 
log(wage) = By + By exper + Boexper? + B3twoyr + By fouryr + u, 


where twoyr is years of junior college attended and fouryr is years completed at a 
four-year college. You have distances from each person’s home at the time of high 
school graduation to the nearest two-year and four-year colleges as instruments for 
twoyr and fouryr. Show how to rewrite this equation to test Ho: 6; = p4 against 
Ho: £4 > p3, and explain how to estimate the equation. See Kane and Rouse (1995) 
and Rouse (1995), who implement a very similar procedure. 
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5.10. Consider IV estimation of the simple linear model with a single, possibly 
endogenous, explanatory variable, and a single instrument: 


Y =Po + pix +u, 
E(u) = 0, Cov(z, u) = 0, Cov(z,x) #0, E(u? |z) = 0°. 


a. Under the preceding (standard) assumptions, show that Avar V/N(B, — £1) can be 
expressed as o° /(p2.02), where o? = Var(x) and p., = Corr(z, x). Compare this result 
with the asymptotic variance of the OLS estimator under Assumptions OLS.1—OLS.3. 


b. Comment on how each factor affects the asymptotic variance of the IV estimator. 
What happens as p., — 0? 


5.11. A model with a single endogenous explanatory variable can be written as 
yy = 210; + 1V2 + u, E(z’'u,) = 0, 


where z = (Z1, Z2). Consider the following two-step method, intended to mimic 2SLS: 


a. Regress y on z2, and obtain fitted values, Y». (That is, zı is omitted from the first- 
stage regression.) 


b. Regress y; on zı, J, to obtain 6; and %. Show that 6, and & are generally in- 
consistent. When would 6, and & be consistent? (Hint: Let y9 be the population 
linear projection of y, on z2, and let a) be the projection error: y} = Z242 + a, 
E(z5a)) = 0. For simplicity, pretend that 2) is known rather than estimated; that is, 
assume that J, is actually y9. Then, write 


Yı = 216, + Hy) + 1a + uy 


and check whether the composite error «az + u1 is uncorrelated with the explanatory 
variables.) 


5.12. In the setup of Section 5.1.2 with x = (x1,...,x«) and z = (X1,%2,...,XK-1, 
Z1,-.-.,Zm) (let x; = 1 to allow an intercept), assume that E(z’z) is nonsingular. 
Prove that rank E(z’x) = K if and only if at least one 0; in equation (5.15) is different 
from zero. (Hint: Write x* = (x1,...,Xx-1,X;) as the linear projection of each ele- 
ment of x on z, where xý = ixi ++: +0x-1XK-1 + 0121 ++: +0mzm. Then x = 
x* +r, where E(z’r) =0, so that E(z'x) = E(z’x*). Now x* = II, where II is 
the L x K matrix whose first K — 1 columns are the first K — 1 unit vectors in R’— 
(1,0,0,...,0)’, (0,1,0,...,0)’,...,(0,0,...,1,0,...,0)/—and whose last column is 
(01,02,-..,0x-1,91,...,0m). Write E(z’x*) = E(z’z)II, so that, because E(z’z) is 
nonsingular, E(z’x*) has rank K if and only if IT has rank K.) 
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5.13. Consider the simple regression model 


y=Pot+Bix+u 

and let z be a binary instrumental variable for x. 

a. Show that the IV estimator , can be written as 

Êi = (V1 — Yo) /(% — Xo), 

where ji) and Xo are the sample averages of y; and x; over the part of the sample with 
zi = 0, and y, and x; are the sample averages of y; and x; over the part of the sample 


with z; = 1. This estimator, known as a grouping estimator, was first suggested by 
Wald (1940). 


b. What is the intepretation of Ê if x is also binary, for example, representing par- 
ticipation in a social program? 


5.14. Consider the model in (5.1) and (5.2), where we have additional exogenous 
variables z1,..., Zm. Let z = (1, x1,...,XK-1,Z1,:--,ZĮm) be the vector of all exoge- 
nous variables. This problem essentially asks you to obtain the 2SLS estimator using 
linear projections. Assume that E(z’z) is nonsingular. 


a. Find L(y|z) in terms of the f;, x1,...,xx-1, and x = L(xx |z). 
b. Argue that, provided x1,...,Xg-1, X% are not perfectly collinear, an OLS regres- 
sion of yon I, x1, .. . , Xg-1, Xg — using a random sample—consistently estimates all p. 


c. State a necessary and sufficient condition for x% not to be a perfect linear combi- 
nation of x,,...,Xx_1. What 2SLS assumption is this identical to? 


5.15. Consider the model y = xf +u, where xi, x2,...,xx,, Ki < K, are the 
(potentially) endogenous explanatory variables. (We assume a zero intercept just to 
simplify the notation; the following results carry over to models with an unknown 
intercept.) Let z),...,zz, be the instrumental variables available from outside the 
model. Let z = (21,...,22,,%K,+1,---,Xx) and assume that E(z’z) is nonsingular, so 
that Assumption 2SLS.2a holds. 

a. Show that a necessary condition for the rank condition, Assumption 2SLS.2b, is 
that for each j = 1,..., Ki, at least one z, must appear in the reduced form of xj. 

b. With K, = 2, give a simple example showing that the condition from part a is not 
sufficient for the rank condition. 

c. If Lı = Ki, show that a sufficient condition for the rank condition is that only z; 
appears in the reduced form for x;, 7 = 1,..., Kı. (As in Problem 5.12, it suffices to 
study the rank of the L x K matrix I in L(x |z) = zII.) 
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5.16. Consider the population model 
y=a+fwtu, 


where all quantities are scalars. Let g be a 1 x L vector, L > 2, with E(g’u) = 0; as- 
sume the first element of g is unity. The following is easily extended to the case of a 
vector w of endogenous variables. 


a. Let f denote the 2SLS estimator of £ using instruments g. (Of course, the intercept 
a is estimated along with £.) Argue that, under the rank condition 2SLS.2 and the 
homoskedasticity Assumption 2SLS.3 (with z = g), 


Avar VN(B — p) = 02/Var(w*), 

where g? = Var(u) and w* = L(w |g). (Hint: See the discussion following equation 
(5.24).) 

b. Let h be a 1 x J vector (with zero mean, to slightly simplify the argument) 
uncorrelated with g but generally correlated with u. Write the linear projection of u 
on has u = hy + v, E(h'v) = 0. Explain why E(g'v) = 0. 


c. Write the equation 
y=at+fhwt+hyt+ov 


and let Ê be the 2SLS estimator of $ using instruments z = (g,h). Assuming 
Assumptions 2SLS.2 and 2SLS.3 for this choice of instruments, show that 


Avar VN(B — p) = o} /Var(w"), 


where o? = Var(v). (Hint: Again use the discussion following equation (5.24), but 
now with x* = (1, ŵ,h), where w = L(w|g,h). Now, the asymptotic variance can be 
written o?/Var(7), where ř is the population residual from the regression w on 1, h. 
Show that 7 = w* when E(g’h) = 0.) 


d. Argue that for estimating the coefficient on an endogenous explanatory variable, 
it is always better to include exogenous variables that are orthogonal to the available 
instruments. For more general results, see Qian and Schmidt (1999). Problem 4.5 
covers the OLS case. 


6 Additional Single-Equation Topics 


6.1 Estimation with Generated Regressors and Instruments 


In this section we discuss the large-sample properties of OLS and 2SLS estimators 
when some regressors or instruments have been estimated in a first step. 


6.1.1 Ordinary Least Squares with Generated Regressors 


We often need to draw on results for OLS estimation when one or more of the 
regressors have been estimated from a first-stage procedure. To illustrate the issues, 
consider the model 


Yy = Po + Bix +++ + Bex + yq +u. (6.1) 


We observe x1,..., Xg, but q is unobserved. However, suppose that q is related to 
observable data through the function q = f (w,ô), where f is a known function and 
w is a vector of observed variables, but the vector of parameters ô is unknown (which 
is why q is not observed). Often, but not always, q will be a linear function of w and 
ô. Suppose that we can consistently estimate 6, and let 6 be the estimator. For each 


observation i, g; = f(w;,6) effectively estimates g;. Pagan (1984) calls ĝ; a generated 
regressor. It seems reasonable that, replacing q; with ĝ; in running the OLS regression 


Yi on 1, Xi, Xi2,-- +, Xiks Ys i=1,...,N, (6.2) 


should produce consistent estimates of all parameters, including y. The question is, 
What assumptions are sufficient? 

While we do not cover the asymptotic theory needed for a careful proof until 
Chapter 12 (which treats nonlinear estimation), we can provide some intuition here. 
Because plim 6 = ô, by the law of large numbers it is reasonable that 


N N 
N X ĝui > Elgin), NT X Tyg S E(xyai). 
i=l i=l 
From these results it is easily shown that the usual OLS assumption in the population— 
that u is uncorrelated with (x1, x2, .. . , Xx, q)—sufħces for the two-step procedure to 
be consistent (along with the rank condition of Assumption OLS.2 applied to the 
expanded vector of explanatory variables). In other words, for consistency, replacing 
qi with ĝ; in an OLS regression causes no problems. 

Things are not so simple when it comes to inference: the standard errors and test 
statistics obtained from regression (6.2) are generally invalid because they ignore the 
sampling variation in ô. Because ô is also obtained using data—usually the same 
sample of data—uncertainty in the estimate should be accounted for in the second 
step. Nevertheless, there is at least one important case where the sampling variation 


124 Chapter 6 


of ô can be ignored, at least asymptotically: if 
E[Vs f (w, 5)'u] = 0, (6.3) 
y=0, (6.4) 


then the VN-limiting distribution of the OLS estimators from regression (6.2) is the 
same as the OLS estimators when q replaces g. Condition (6.3) is implied by the zero 
conditional mean condition, 


E(u|x,w) = 0, (6.5) 


which frequently holds in generated regressor contexts. 

We often want to test the null hypothesis Ho: y = 0 before including ĝ in the final 
regression. Fortunately, the usual ¢ statistic on g has a limiting standard normal dis- 
tribution under Ho, so it can be used to test Ho. It simply requires the usual homo- 
skedasticity assumption, E(u?|x,q) = 07. The heteroskedasticity-robust statistic 
works if heteroskedasticity is present in u under Ho. 

Even if condition (6.3) holds, if y #0, then an adjustment is needed for the 
asymptotic variances of all OLS estimators that are due to estimation of ô. Thus, 
standard ¢ statistics, F statistics, and LM statistics will not be asymptotically valid 
when y #0. Using the methods of Chapter 3, it is not difficult to derive an ad- 
justment to the usual variance matrix estimate that accounts for the variability in 
ô (and also allows for heteroskedasticity). It is not true that replacing q; with ĝ; 
simply introduces heteroskedasticity into the error term; this is not the correct way 
to think about the generated regressors issue. Accounting for the fact that 6 depends 
on the same random sample used in the second-stage estimation is much different 
from having heteroskedasticity in the error. Of course, we might want to use a 
heteroskedasticity-robust standard error for testing Ho: y=0 because hetero- 
skedasticity in the population error u can always be a problem. However, just as with 
the usual OLS standard error, this is generally justified only under Ho: y = 0. 

A general formula for the asymptotic variance of 2SLS in the presence of gen- 
erated regressors is given in the appendix to this chapter; this covers OLS with gen- 
erated regressors as a special case. A general framework for handling these problems 
is given in Newey (1984) and Newey and McFadden (1994), but we must hold off 
until Chapter 14 to give a careful treatment. 


6.1.2 Two-Stage Least Squares with Generated Instruments 


In later chapters we will need results on 2SLS estimation when the instruments have 
been estimated in a preliminary stage. Write the population model as 
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y=xß+u, (6.6) 
E(z'u) = 0, (6.7) 


where x isa 1 x K vector of explanatory variables and z is a 1 x L (L > K) vector of 
intrumental variables. Assume that z = g(w, 4), where g(- , 2) is a known function but 
4 needs to be estimated. For each i, define the generated instruments z; = g(w;, Â). 
What can we say about the 2SLS estimator when the z; are used as instruments? 

By the same reasoning for OLS with generated regressors, consistency follows 
under weak conditions. Further, under conditions that are met in many applications, 
we can ignore the fact that the instruments were estimated in using 2SLS for infer- 


ence. Sufficient are the assumptions that Â is V’N-consistent for A and that 
E[Vig(w, 4)’s] = 0. (6.8) 


Under condition (6.8), which holds when E(u|w) = 0, the //N-asymptotic distribu- 
tion of ĝ is the same whether we use å or A in constructing the instruments. This fact 
greatly simplifies calculation of asymptotic standard errors and test statistics. There- 
fore, if we have a choice, there are practical reasons for using 2SLS with generated 
instruments rather than OLS with generated regressors. We will see some examples in 
Section 6.4 and Part IV. 

One consequence of this discussion is that, if we add the 2SLS homoskedasticity 
assumption (2SLS.3), the usual 2SLS standard errors and test statistics are asymp- 
totically valid. If Assumption 2SLS.3 is violated, we simply use the heteroskedasticity- 
robust standard errors and test statistics. Of course, the finite sample properties of the 
estimator using ĉ; as instruments could be notably different from those using z; as 
instruments, especially for small sample sizes. Determining whether this is the case 
requires either more sophisticated asymptotic approximations or simulations on a 
case-by-case basis. 


6.1.3 Generated Instruments and Regressors 


We will encounter examples later where some instruments and some regressors are 
estimated in a first stage. Generally, the asymptotic variance needs to be adjusted 
because of the generated regressors, although there are some special cases where the 
usual variance matrix estimators are valid. As a general example, consider the model 


y=xBP+ f(w,d)+u, E(u|z,w) = 0, 


and we estimate ô in a first stage. If y = 0, then the 2SLS estimator of (f',y)’ in the 
equation 


126 Chapter 6 


yi = XiP+ of, + error, 


using instruments (z;, f;), has a limiting distribution that does not depend on the 
limiting distribution of /N(é — 6) under conditions (6.3) and (6.8). Therefore, the 
usual 2SLS ¢ statistic for f, or its heteroskedsticity-robust version, can be used to test 
Ho: y= 0. 


6.2 Control Function Approach to Endogeneity 


For several reasons, including handling endogeneity in nonlinear models—which 
we will encounter often in Part IV—it is very useful to have a different approach 
for dealing with endogenous explanatory variables. Generally, the control function 
approach uses extra regressors to break the correlation between endogenenous 
explanatory variables and unobservables affecting the response. As we will see, the 
method still relies on the availability of exogenous variables that do not appear in the 
structural equation. For notational clarity, yı denotes the response variable, yz is the 
endogenous explanatory variable (a scalar for simplicity), and z is the 1 x L vector of 
exogenous variables (which includes unity as its first element). Consider the model 


Yı = 20, + 41 y2 + u1, (6.9) 


where zı is a 1 x Lı strict subvector of z that also includes a constant. The sense in 
which z is exogenous is given by the L orthogonality (zero covariance) conditions 


E(z'u;) = 0. (6.10) 


Of course, this is the same exogeneity condition we use for consistency of the 2SLS 
estimator, and we can consistently estimate ô; and « by 2SLS under (6.10) and the 
rank condition, Assumption 2SLS.2. 

Just as with 2SLS, the reduced form of y2—that is, the linear projection of y2 onto 
the exogenous variables—plays a critical role. Write the reduced form with an error 
term as 


V2 = Zm? + 02, (6.11) 
E(z'v2) = 0, (6.12) 


where m is L x 1. From equations (6.11) and (6.12), endogeneity of y arises if and 
only if u; is correlated with v2. Write the linear projection of u; on v2, in error form, 
as 


ui = pila + ei, (6.13) 
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where p; = E(v2u1)/E(v>) is the population regression coefficient. By definition, 
E(v2e1) = 0, and E(z'e1) = 0 because u; and vz are both uncorrelated with z. 
Plugging (6.13) into equation (6.9) gives 


Vi = 210; + M1 y2 + p102 + 61, (6.14) 


where we now view v2 as an explanatory variable in the equation. As just noted, e4, is 
uncorrelated with v2 and z. Plus, y2 is a linear function of z and v2, and so e; is also 
uncorrelated with y2. 

Because e; is uncorrelated with zı, y2, and v2, equation (6.14) suggests a simple 
procedure for consistently estimating 6; and « (as well as p,): run the OLS regression 
of yı on zı, y2, and v using a random sample. (Remember, OLS consistently esti- 
mates the parameters in any equation where the error term is uncorrelated with the 
right-hand-side variables.) The only problem with this suggestion is that we do not 
observe v2; it is the error in the reduced-form equation for y2. Nevertheless, we can 
write v2 = y2 — zm, and, because we collect data on y2 and z, we can consistently 
estimate m2 by OLS. Therefore, we can replace v2 with 62, the OLS residuals from the 
first-stage regression of y2 on z. Simple substitution gives 


Vi = 20, + %1y2 + p102 + error, (6.15) 


where, for each i, error; = ei + p,zZ;(%2 — m2), which depends on the sampling error 
in ĉ unless p; = 0. We can now use the consistency result for generated regressors 
from Section 6.1.1 to conclude that the OLS estimators from equation (6.15) will be 
consistent for ô, %1, and p}. 

The OLS estimator from equation (6.15) is an example of a control function (CF) 
estimator. The inclusion of the residuals 6, “controls” for the endogeneity of yz in the 
original equation (although it does so with sampling error because %) # 72). 

How does the CF approach compare with the standard instrumental variables 
(IV) approach, which, for now, means 2SLS? Interestingly, as you are asked to 
show in Problem 5.1 (using least squares algebra), the OLS estimates of 6; and « 
from equation (6.15) are identical to the 2SLS estimates. In other words, the two 
approaches lead to the same place. Equation (6.15) contains a generated regressor, 
and so obtaining the appropriate standard errors for the CF estimators requires 
applying the correction derived in Appendix 6A. Why, then, have we introduced the 
control function approach when it leads to the same place as 2SLS? 

There are a couple of reasons. First, as we will see in the next section, equation 
(6.15) leads to a straightforward test of the null hypothesis that y2 is exogenous. 
Second, and perhaps more important, the CF approach can be adapted to certain 
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nonlinear models in cases where analogues of 2SLS have undesirable properties. In 
fact, in Part IV we will rely heavily on CF approaches for handling endogeneity in a 
variety of nonlinear models. 

You have perhaps already figured out where the CF approach relies on having at 
least one element in z not also in z;—as also required by 2SLS. We can easily see this 
from equation (6.15) and the expression 62 = yn — Zt, i = 1,..., N. If z; = zi, then 
the ĉn are exact linear functions of yp and z;, in which case equation (6.15) suffers 
from perfect collinearity. If z; contains at least one element not in z;, this breaks the 
perfect collinearity (at least in the sample). If we write z = (z1, Z2) and 22) = 0, then, 
asymptotically, the CF approach suffers from perfect collinearity, and « and 0; are 
not identified, as we already know from our 2SLS analysis. 

Although we just argued that the CF approach is identical to 2SLS in a linear 
model, this equivalence does not carry over to models containing more than one 
function of y2. A simple extension of (6.9) is 


yı = zô; + %12 + y1V7 + ü, (6.16) 
E(u |z) = 0. (6.17) 


For simplicity, assume that we have a scalar, z2, that is not also in zı. Then, under 
assumption (6.17), we can use, say, 25 as an instrument for y? because any function 
of z2 is uncorrelated with u;. (We must exclude the case where z) is binary, because 
then z2 = z2.) In other words, we can apply the standard IV estimator with explana- 
tory variables (z1, y2, v3) and instruments (z1, Z2, 23); note that we have two endoge- 
nous explanatory variables, y2 and y3. 

What would the CF approach entail in this case? To implement the CF approach 
in equation (6.16), we obtain the conditional expectation E( yı |Z, y2)—a linear pro- 
jection argument no longer works because of the nonlinearity—and that requires an 
assumption about E(w; |z, y2). A standard assumption is 


E(u |Z, y2) = E(u |z, v2) = E(u | v2) = p102, (6.18) 


where the first equality follows because y2 and v2 are one-to-one functions of each 
other (given z) and the second would hold if (u1, v2) is independent of z—a nontrivial 
restriction on the reduced-form error in equation (6.11), not to mention the structural 
error u,. The final assumption is linearity of the conditional expectation E(u; | v2), 
which is more restrictive than simply defining a linear projection. Under (6.18), 


E(y1 |z, y2) = 216) + ayo +1. ¥3 + E(u |z, y2), 


= 2,0; + 41 y2 + V3 + p102. (6.19) 
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Implementing the CF approach means running the OLS regression yı on Z1, 2, V3; 
b2, where #2 still represents the reduced-form residuals. The CF estimates are not the 
same as the 2SLS estimates using any choice of instruments for (y2, y3). 

We will study models such as (6.16) in more detail in Chapter 9—in particular, 
we will discuss identification and choice of instruments—but a few comments are 
in order. First, if the standard IV and CF approaches no longer lead to the same 
estimates, do we prefer one over the other? Not surprisingly, there is a trade-off be- 
tween robustness and efficiency. For the 2SLS estimator using instruments (Z1, Z2, 23), 
the key assumption is (6.17), along with partial correlation between y2 and z2 in the 
RF for yo. (This almost certainly ensures that y3 is appropriately partially correlated 
with z2.) But the CF estimator requires these two assumptions and, in addition, the 
last two equalities in (6.18). As shown in Problem 6.13, equations (6.17) and (6.18) 
imply that E(v2|z) = 0, which means that E(y2|z) = zm2. A linear conditional ex- 
pectation for y2 is a substantive restriction on the conditional distribution of yo. 
Therefore, the CF estimator will be inconsistent in cases where the 2SLS estimator 
will be consistent. On the other hand, because the CF estimator solves the endoge- 
neity of y2 and y3 by adding the scalar 6, to the regression, it will generally be more 
precise—perhaps much more precise—than the 2SLS estimator. But a systematic 
analysis comparing the two approaches in models such as (6.16) has yet to be done. 
Assumptions such as (6.18) are typically avoided if possible, but this could be costly 
in terms of inference. In Section 9.5 we will further explore models nonlinear in 
endogenous variables. 


6.3 Some Specification Tests 


In Chapters 4 and 5 we covered what is usually called classical hypothesis testing for 
OLS and 2SLS. We now turn to testing some of the assumptions underlying consis- 
tency of either OLS or 2SLS; these are typically called specification tests. These tests, 
including ones robust to arbitrary heteroskedasticity, are easy to obtain and should 
be routinely reported in applications. 


6.3.1 Testing for Endogeneity 


There are various ways to motivate tests for whether some explanatory variables are 
endogenous. If the null hypothesis is that all explanatory variables are exogenous, 
and we allow one or more to be endogenous under the alternative, then we can base a 
test on the difference between the 2SLS and OLS estimators, provided we have suffi- 
cient exogenous instruments to identify the parameters by 2SLS. Tests for endoge- 
neity have been derived independently by Durbin (1954), Wu (1973), and Hausman 
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(1978). In the general equation y= xf +u with instruments z, the Durbin-Wu- 
Hausman (DWH) test is based on the difference B75 — Bors. If all elements of x are 
exogenous (and z is also exogenous—a maintained assumption), then 2SLS and OLS 
should differ only due to sampling error. Of course, to determine whether this is 
so, we need to estimate the asymptotic variance of VN(Bys75 — Bors). Generally, the 
calculation is cumbersome, but it simplifies considerably if we maintain homo- 
skedasticity under the null hypothesis. Problem 6.12 asks you to verify that, under the 
null hypothesis E(x’u) = 0, and the appropriate homoskedasticity assumption, 


Avar|VN(Bosrs = Pors) = o°[E(x"x*)] am PEx x), (6.20) 


which is simply the difference between the asymptotic variances. (Hausman (1978) 
shows this kind of result holds more generally when one estimator, OLS in this case, 
is asymptotically efficient under the null.) 

Once we estimate each component in equation (6.20) we can construct a quadratic 
form in Å szs — Bozs that has an asymptotic chi-square distribution. The estimators 
of the moment matrices are the usual ones, with the first-stage fitted values x; = zl 
replacing x;, as usual. For a7, there are three possibilities. The first is to estimate the 
error variances separately for 2SLS and OLS and insert these in the first and second 
occurrences of g?, respectively. Or we can just use the 2SLS estimate or just use the 
OLS estimate in both places. Using the latter, we obtain one version of the DWH 
statistic: 


(Boszs — Bors) I'R) — (X'X)'T (Basis ~ Bors)/ Fors: (6.21) 


where we must use a generalized inverse, except in the very unusual case that all 
elements of x are allowed to be endogenous under the alternative. In fact, the rank 
of Avar|VN (Ê srs — Bozs)] is equal to the number of elements of x allowed to be 
endogenous under the alternative. The singularity of the matrix in expression (6.21) 
makes computing the statistic cumbersome. Nevertheless, some variant of the sta- 
tistic is routinely computed by several popular econometrics packages. See Baum, 
Schaffer, and Stillman (2003) for a detailed survey and recently written software. 

A more serious drawback with any statistic based on equation (6.20) is that 
it is not robust to heteroskedasticity. A robust variance matrix estimator for 
Avar[VN(Êsrs — Bozs)| can be obtained (see Problem 6.12 and Baum, Schaffer, and 
Stillman (2003)). Unfortunately, calculating robust DWH statistics is done much 
less frequently than computing standard errors and classical test statistics that 
are robust to heteroskedasticity. Infrequent use of a fully robust DWH statistic 
may be partly due to misunderstandings of when the principle of comparing estima- 
tors applies. It is commonly thought that one estimator should be asymptotically ef- 
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ficient under the null hypothesis, but this is not necessary—indeed, it is essentially 
irrelevant. The principle applies whenever two estimators are consistent under the 
null hypothesis and one estimator, 2SLS in this case, retains consistency under the 
alternative. Asymptotic efficiency only simplifies the asymptotic variance estimation. 
In fact, while the standard form of the DWH test maintains OLS.3 and 2SLS.3 in 
order to have the correct asymptotic size, it has no systematic power for detecting 
heteroskedasticity. 

Regression-based endogeneity tests are very convenient because they are easily 
computed and almost trivial to make robust to heteroskedasticity. As pointed out by 
Hausman (1978, 1983), there is a regression-based statistic asymptotically equivalent 
to equation (6.21). Suppose initially that we have a single potentially endogenous 
explanatory variable, as in equation (6.9). We maintain assumption (6.10)—that is, 
each element of z is exogenous. We want to test 


Ho: Cov(y2,u) = 0 (6.22) 


against the alternative that y2 is correlated with u;. Given the reduced-form equation 
(6.11), equation (6.22) is equivalent to Cov(v2, u1) = E(v2u1) = 0, or Ho: pı = 0, 
where pı is the regression coefficient defined in equation (6.13). Even though u 
and v2 are not observed, we can use equation (6.15) to test that p; = 0. In fact, when 
we estimate equation (6.15) by OLS, we obtain an OLS coefficient on 2, p,. Con- 
veniently, using the results from Section 6.1.1, when p; = 0 we do not have to adjust 
the standard error of f, for the first-stage estimation of 22. Therefore, if the homo- 
skedasticity assumption E(u? | z, y2) = E(u?) holds under Ho, we can use the usual t 
statistic for 6, as an asymptotically valid test. Conveniently, we can simply use a 
heteroskedasticity-robust ¢ statistic if heteroskedasticity is suspected under Ho. 

We should remember that the OLS standard errors that would be reported from 
equation (6.15) are not valid unless p} = 0, because 62 is a generated regressor. In 
practice, if we reject Ho: p} = 0, then, to get the appropriate standard errors and 
other test statistics, we estimate equation (6.9) by 2SLS. 


Example 6.1 (Testing for Endogeneity of Education in a Wage Equation): Consider 
the wage equation 


log(wage) = dy + exper + dyexper? + a,educ + ui (6.23) 


for working women, where we believe that educ and u) may be correlated. The 
instruments for educ are parents’ education and husband’s education. So, we first 
regress educ on 1, exper, exper”, motheduc, fatheduc, and huseduc and obtain the 
residuals, 62. Then we simply include 62 along with unity, exper, exper”, and educ in 
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an OLS regression and obtain the ¢ statistic on #2. Using the data in MROZ.RAW 
gives the result ô, = .047 and tj, = 1.65. We find evidence of endogeneity of educ at 
the 10 percent significance level against a two-sided alternative, and so 2SLS is 
probably a good idea (assuming that we trust the instruments). The correct 2SLS 
standard errors are given in Example 5.3. 


With only a single suspected endogenous explanatory variable, it is straightfor- 
ward to compute a Hausman (1978) statistic that directly compares 2SLS and OLS, 
at least when the homoskedasticity assumption holds (for 2SLS and OLS). Then, 
Avar(ĉı 2sLS = ĉi OLS) = Avar(&1, 2575) = Avar(d1, ozs). Therefore, the Hausman t 
statistic is simply (G1,2SL8 — &oxs)/({se(&1,2s78)]7 — [se(âĉi ozs)]^) 1/2, Under the null 
hypothesis, the ¢ statistic has an asymptotically standard normal distribution. 
Unfortunately, there is no simple correction if one allows heteroskedasticity: the 
asymptotic variance of the difference is no longer the difference in asymptotic variances. 

Extending the regression-based Hausman test to several potentially endogenous 
explanatory variables is straightforward. Let y, denote a 1 x G; vector of possible 
endogenous variables in the population model 


Yı = Zıô1 + y201 + u1, E(z'ui) = 0, (6.24) 


where a, is now G; x 1. Again, we assume the rank condition for 2SLS. Write the 
reduced form as y, = zII2 + v2, where II) is L x G; and vp is the 1 x G; vector of 
population reduced form errors. For a generic observation, let Ŷ® denote the 1 x Gi 
vector of OLS residuals obtained from each reduced form. (In other words, take each 
element of y, and regress it on z to obtain the RF residuals, then collect these in the 
row vector ¥2.) Now, estimate the model 


Yı = 210; + y201 + V2p, + error (6.25) 


and do a standard F test of Ho: pı = 0, which tests G; restrictions in the unrestricted 
model (6.25). The restricted model is obtained by setting p; = 0, which means we 
estimate the original model (6.24) by OLS. The test can be made robust to hetero- 
skedasticity in uw (since u; = e; under Ho) by applying the heteroskedasticity-robust 
Wald statistic in Chapter 4. In some regression packages, such as Stata, the robust 
test is implemented as an F-type test. More precisely, the robust Wald statistic, which 
has an asymptotic chi-square distribution, is divided by G; to give a statistic that can 
be compared with F critical values. 

An alternative to the F test is an L-type test. Let ù; be the OLS residuals from 
the regression yı on Z1, yz (the residuals obtained under the null hypothesis that y, is 
exogenous). Then, obtain the usual R-squared (assuming that z, contains a constant), 
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say R;,, from the regression 


uy on Z1,Y2,V2 (6.26) 


and use NR? as asymptotically %&,. This test again maintains homoskedasticity under 
Ho. The test can be made heteroskedasticity-robust using the method described in 
equation (4.17): take x; = (z1, y2) and x2 = ¥2. See also Wooldridge (1995b). 


Example 6.2 (Endogeneity of Education in a Wage Equation, continued): We add 
the interaction term black-educ to the log(wage) equation estimated by Card (1995); 
see also Problem 5.4. Write the model as 


log(wage) = «educ + œ black-educ + 216; + u, (6.27) 


where zı contains a constant, exper, exper’, black, smsa, 1966 regional dummy vari- 
ables, and a 1966 SMSA indicator. If educ is correlated with u1, then we also expect 
black-educ to be correlated with u,. If nearc4, a binary indicator for whether a worker 
grew up near a four-year college, is valid as an instrumental variable for educ, then a 
natural instrumental variable for black-educ is black-nearc4. Note that black-nearc4 is 
uncorrelated with uw; under the conditional mean assumption E(u |z) = 0, where z 
contains all exogenous variables. 
The equation estimated by OLS is 


ae 


log(wage) = 4.81 + .071 educ + .018 black-educ — .419 black +---. 
(0.75) (.004) (.006) (.079) 


Therefore, the return to education is estimated to be about 1.8 percentage points 
higher for blacks than for nonblacks, even though wages are substantially lower for 
blacks at all but unrealistically high levels of education. (It takes an estimated 23.3 
years of education before a black worker earns as much as a nonblack worker.) 

To test whether educ is exogenous, we must test whether educ and black-educ are 
uncorrelated with u;. We do so by first regressing educ on all instrumental variables: 
those elements in z; plus nearc4 and black-nearc4. (The interaction black-nearc4 
should be included because it might be partially correlated with educ.) Let 62; be the 
OLS residuals from this regression. Similarly, regress black-educ on 2, nearc4, and 
black-nearc4, and save the residuals 622. By the way, the fact that the dependent 
variable in the second reduced-form regression, black-educ, is zero for a large fraction 
of the sample has no bearing on how we test for endogeneity. 

Adding #2; and 622 to the OLS regression and computing the joint F test yields F = 
0.54 and p-value = 0.581; thus we do not reject exogeneity of educ and black-educ. 
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Incidentally, the reduced-form regressions confirm that educ is partially corre- 
lated with nearc4 (but not black-nearc4) and black-educ is partially correlated with 
black-nearc4 (but not nearc4). It is easily seen that these findings mean that the rank 
condition for 2SLS is satisfied—see Problem 5.15c. Even though educ does not ap- 
pear to be endogenous in equation (6.27), we estimate the equation by 2SLS: 


— 


log(wage) = 3.84 + .127educ + .011 black-educ — .283 black +---. 
(0.97) (.057) (.040) (.506) 


The 2SLS point estimates certainly differ from the OLS estimates, but the standard 
errors are so large that the 2SLS and OLS estimates are not statistically different. 

Sometimes we may want to test the null hypothesis that a subset of explanatory 
variables is exogenous while allowing another set of variables to be endogenous. As 
described in Davidson and MacKinnon (1993, Section 7.9), it is straightforward to 
obtain the test based on equation (6.25). Write an expanded model as 


Yı = 710, + ya, + 37) + u1, (6.28) 


where a; is Gi x l and y; is Jı x 1. We allow y, to be endogenous and test 
Ho: E(y4u1) = 0. The relevant equation is now yı = z)0, + y2@1 + y37; + V3p; + 61, 
or, when we operationalize it, 


yı = 210, + Ya) + Y3y, + ¥3p, + error, (6.29) 


where p, now represents the vector of population regression coefficients from u 
on v3. Because y, is allowed to be endogenous under Ho, we cannot estimate equa- 
tion (6.29) by OLS in order to test Ho: p} = 0. Instead, we apply 2SLS to equation 
(6.29) with instruments (z, y3, ¥3); remember, (y3, v3) are exogenous in the augmented 
equation. In effect, we still instrument for y,, but y, and ¥3 act as their own instru- 
ments. Using the results on IV estimation with generated regressors and instruments 
in Section 6.1.3, the usual Wald statistic (possibly implemented as an F statistic) 
for testing Ho: p; = 0 is asymptotically valid under Ho. As usual, it may be prudent 
to allow heteroskedasticity of unknown form under Ho, and this is easily done using 
econometric software that computes heteroskedasticity-robust tests of exclusion 
restrictions after 2SLS estimation. 


6.3.2 Testing Overidentifying Restrictions 


When we have more instruments than we need to identify an equation, we can test 
whether the additional instruments are valid in the sense that they are uncorrelated 
with u;. To explain the various procedures, write the equation in the form 
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Yı = 20; + yoai + u, (6.30) 


where zı is 1 x Lı and y, is 1 x G1. The 1 x L vector of all exogenous variables is 
again z; partition this as z = (z},zZ2) where z2 is 1 x Ly and L = Lı + L2. Because the 
model is overidentified, L2 > G,. Under the usual identification conditions we could 
use any | x G; subset of z) as instruments for y, in estimating equation (6.30) (re- 
member the elements of zı act as their own instruments). Following his general 
principle, Hausman (1978) suggested comparing the 2SLS estimator using all instru- 
ments to 2SLS using a subset that just identifies equation (6.30). If all instruments 
are valid, the estimates should differ only as a result of sampling error. As with testing 
for endogeneity, the Hausman (1978) test based on a quadratic form in the coeffi- 
cient differences can be cumbersome to compute. Fortunately, a regression-based test, 
originally due to Sargan (1958), is available. 

The Sargan test maintains homoskedasticity (Assumption 2SLS.3) under the null 
hypothesis. It is easily obtained as NR? from the OLS regression 


uy ON Z, (6.31) 


where i, are the 2SLS residuals using all of the instruments z and R is the usual R- 
squared (assuming that z; and z contain a constant; otherwise it is the uncentered R- 
squared). In other words, simply estimate regression (6.30) by 2SLS and obtain the 
2SLS residuals, 1. Then regress these on all exogenous variables (including a con- 
stant). Under the null that E(z'u1) = 0 and Assumption 2SLS.3, NR? ~ %%,, where 
Qı = L — G; is the number of overidentifying restrictions. 

The usefulness of the Sargan-Hausman test is that, if we reject the null hypothesis, 
then our logic for choosing the IVs must be reexamined. Unfortunately, the test does 
not tell us which IVs fail the exogeneity requirement; it could be one of them or all of 
them. (The symmetric way that all exogenous variables appear in regression in (6.31) 
makes it clear the test cannot single out faulty instruments.) If we fail to reject the 
null hypothesis, then we can have some confidence in the set of instruments used—up 
to a point. Even if we do not reject the null hypothesis, it is possible that more than 
one instrument is endogenous, and that the 2SLS estimators using a full and reduced 
set of instruments are asymptotically biased in similar ways. For example, suppose 
we have a single endogenous explanatory variable, years of schooling (educ), in a 
wage equation, and we propose two instruments, mother’s and father’s years of 
schooling, motheduc and fatheduc. The test of overidentifying restrictions is the same 
as comparing two IV estimates of the return to schooling: one uses motheduc as the 
only instrument for educ and the other uses fatheduc as the only instrument. We can 
easily think neither instrument is truly exogenous, and each is likely to be positively 
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correlated with unobserved cognitive ability. Therefore, we might expect the two IV 
estimates to give similar answers (and maybe even similar to OLS). But we should 
not take the similarity in the estimates to mean that the IVs are definitely exogenous; 
both could be leading us astray in the same direction with roughly the same bias 
magnitude. Even in cases where the point estimates are practically different, we might 
fail to reject exogenous instruments simply because the standard errors of the two IV 
estimates are large. 

It is straightforward to obtain a heteroskedasticity-robust test of the over- 
identifying restrictions, but we need to separate the instrumental variables into two 
groups. Let z2 be the 1 x Ly vector of exogenous variables excluded from equation 
(6.24) and write z = (g,,ho), where g, is 1 x G;—the same dimension as y,—and 
hy is 1 x Q;—the number of overidentifying restrictions. It turns out not to matter 
how we do this division, provided hy has Q; elements. Wooldridge (1995b) shows 
that the following procedure is valid. As in equation (6.31), let ù, be the 2SLS resid- 
uals from estimating equation (6.24), and let y, denote the fitted values from the first 
stage regression, y, on z (each element of y, onto z). Next, regress each element of hz 
onto (z1,¥,) and save the residuals, say fọ (1 x Q; for each observation). Then the 
LM statistic is obtained as N — SSRo, where SSRo is the sum of squared residuals 
from the regression 1 on f2. Under Ho, and without assuming homoskedasticity, 
N —SSRo ~ Xor Alternatively, one may use a heteroskedasticity-robust Wald test 
of Ho : y; = 0 in the auxiliary model 


a = Foy, + error. (6.32) 


This approach differs from the LM statistic in that residuals ê; = % —f)#, are 
used in the implicit variance matrix estimator, rather than #,. Under the null hy- 
pothesis (and local alternatives), í Z, 0, and so the statistics are asymptotically 
equivalent. 


Example 6.3 (Overidentifying Restrictions in the Wage Equation): In estimating 
equation (6.23) by 2SLS, we used (motheduc, fatheduc, huseduc) as instruments for 
educ. Therefore, there are two overidentifying restrictions. Letting % be the 2SLS 
residuals from equation (6.23) using all instruments, the test statistic is N times the R- 
squared from the OLS regression 


û on 1, exper, exper?, motheduc, fatheduc, huseduc 


Under Ho and homoskedasticity, NR? ~ 73. Using the data on working women in 
MROZ.RAW gives R? = .0026, and so the overidentification test statistic is about 
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1.11. The p-value is about .574, so the overidentifying restrictions are not rejected at 
any reasonable level. 

For the heteroskedasticity-robust version, one approach is to obtain the residuals, 
?, and ñ, from the OLS regressions motheduc on 1, exper, exper, and educ and 
fatheduc on 1, exper, exper”, and educ, where educ are the first-stage fitted values 
from the regression educ on 1, exper, exper”, motheduc, fatheduc, and huseduc. Then 
obtain N — SSR from the OLS regression | on ûı- fi, ûı- ĉ2. Using only the 428 
observations on working women to obtain 7, and 7, the value of the robust test sta- 
tistic is about 1.04 with p-value = .595, which is similar to the p-value for the non- 
robust test. 


6.3.3 Testing Functional Form 


Sometimes we need a test with power for detecting neglected nonlinearities in models 
estimated by OLS or 2SLS. A useful approach is to add nonlinear functions, such as 
squares and cross products, to the original model. This approach is easy when all 
explanatory variables are exogenous: F statistics and LM statistics for exclusion 
restrictions are easily obtained. It is a little tricky for models with endogenous ex- 
planatory variables because we need to choose instruments for the additional non- 
linear functions of the endogenous variables. We postpone this topic until Chapter 9, 
when we discuss simultaneous equation models. See also Wooldridge (1995b). 

Putting in squares and cross products of all exogenous variables can consume 
many degrees of freedom. An alternative is Ramsey’s (1969) RESET, which has 
degrees of freedom that do not depend on K. Write the model as 


y=xß +u, (6.33) 
E(u|x)=0. (6.34) 


(You should convince yourself that it makes no sense to test for functional form if we 
only assume that E(x’u) = 0. If equation (6.33) defines a linear projection, then, by 
definition, functional form is not an issue.) Under condition (6.34) we know that any 
function of x is uncorrelated with u (hence the previous suggestion of putting squares 
and cross products of x as additional regressors). In particular, if condition (6.34) 
holds, then (xf)” is uncorrelated with u for any integer p. Since f is not observed, we 
replace it with the OLS estimator, f. Define j, = x;f as the OLS fitted values and a; 
as the OLS residuals. By definition of OLS, the sample covariance between ù; and 7; 
is zero. But we can test whether the ù; are sufficiently correlated with low-order poly- 
nomials in j,;, say 97, $3, and $4, as a test for neglected nonlinearity. There are a 
couple of ways to do so. Ramsey suggests adding these terms to equation (6.33) and 
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doing a standard F test (which would have an approximate 73 y_x_3 distribution 
under equation (6.33) and the homoskedasticity assumption E(u? | x) = a”). Another 
possibility is to use an LM test: Regress û; onto x;, $7, 7, and ff and use N times 
the R-squared from this regression as 73. The methods discussed in Chapter 4 for 
obtaining heteroskedasticity-robust statistics can be applied here as well. Ramsey’s 
test uses generated regressors, but the null hypothesis is that each generated regressor 
has zero population coefficient, and so the usual limit theory applies. (See Section 
6.1.1.) 

There is some misunderstanding in the testing literature about the merits of 
RESET. It has been claimed that RESET can be used to test for a multitude of 
specification problems, including omitted variables and heteroskedasticity. In fact, 
RESET is generally a poor test for either of these problems. It is easy to write down 
models where an omitted variable, say g, is highly correlated with each x, but RESET 
has the same distribution that it has under Ho. A leading case is seen when E(q | x) is 
linear in x. Then E(y |x) is linear in x [even though E(y|x) # E(y|x,q)], and the 
asymptotic power of RESET equals its asymptotic size. See Wooldridge (1995b) and 
Problem 6.4a. The following is an empirical illustration. 


Example 6.4 (Testing for Neglected Nonlinearities in a Wage Equation): We use 
OLS and the data in NLS80.RAW to estimate the equation from Example 4.3: 


log(wage) = By + Bexper + Bytenure + B,married + B,south 
+ Bsurban + Bg black + p educ + u. 


The null hypothesis is that the expected value of u given the explanatory variables 
in the equation is zero. The R-squared from the regression ù on x, $7, and ĵ? yields 
R? = .0004, so the chi-square statistic is .374 with p-value ~ .83. (Adding ff only 
increases the p-value.) Therefore, RESET provides no evidence of functional form 
misspecification. 

Even though we already know IQ shows up very significantly in the equation 
(t statistic = 3.60; see Example 4.3), RESET does not, and should not be expected to, 
detect the omitted variable problem. It can only test whether the expected value of y 
given the variables actually in the regression is linear in those variables. 


6.3.4 Testing for Heteroskedasticity 


As we have seen for both OLS and 2SLS, heteroskedasticity does not affect the con- 
sistency of the estimators, and it is only a minor nuisance for inference. Nevertheless, 
sometimes we want to test for the presence of heteroskedasticity in order to justify use 
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of the usual OLS or 2SLS statistics. If heteroskedasticity is present, more efficient 
estimation is possible. 

We begin with the case where the explanatory variables are exogenous in the sense 
that u has zero mean given x: 


y =Po +x +u, E(u|x) = 0. 


The reason we do not assume the weaker assumption E(x’u) = 0 is that the fol- 
lowing class of tests we derive—which encompasses all of the widely used tests for 
heteroskedasticity—are not valid unless E(u |x) = 0 is maintained under Ho. Thus, 
we maintain that the mean E(y |x) is correctly specified, and then we test the con- 
stant conditional variance assumption. If we do not assume correct specification of 
E(y|x), a significant heteroskedasticity test might just be detecting misspecified 
functional form in E(y |x); see Problem 6.4c. 

Because E(u|x) =0, the null hypothesis can be stated as Ho: E(u? |x) = 0°. 
Under the alternative, E(u? |x) depends on x in some way. Thus, it makes sense to 
test Ho by looking at covariances 


Cov[h(x), u?] (6.35) 


for some 1 x Q vector function h(x). Under Ho, the covariance in expression (6.35) 
should be zero for any choice of h(-). 

Of course, a general way to test zero correlation is to use a regression. Putting 7 
subscripts on the variables, write the model 


u? = 69 + h;ô + v, (6.36) 


where h; = h(x;); we make the standard rank assumption that Var(h;) has rank Q, so 
that there is no perfect collinearity in h;. Under Ho, E(v; |h;) = E(v; | x;) = 0, 6 = 0, 
and 69 = a7. Thus, we can apply an F test or an LM test for the null Ho: 6 = 0 
in equation (6.36). One thing to notice is that v; cannot have a normal distribution 
under Ho: because v; = u? — a”, vi > —o”. This does not matter for asymptotic anal- 
ysis; the OLS regression from equation (6.36) gives a consistent, /N-asymptotically 
normal estimator of ô whether or not Ho is true. But to apply a standard F or LM 
test, we must assume that, under Ho, E(v?|x;) is constant—that is, the errors in 
equation (6.36) are homoskedastic. In terms of the original error u;, this assumption 
implies that 


E(u‘ | x;) = constant = x? (6.37) 


under Ho. This is called the homokurtosis (constant conditional fourth moment) 
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assumption. Homokurtosis always holds when u is independent of x, but there are 
conditional distributions for which E(u|x) =0 and Var(u|x) =o? but E(u’ |x) 
depends on x. 

As a practical matter, we cannot test ô = 0 in equation (6.36) directly because u; is 
not observed. Since u; = y; — X;ß and we have a consistent estimator of $, it is natu- 
ral to replace u? with ?, where the ù; are the OLS residuals for observation i. Doing 
this step and applying, say, the LM principle, we obtain NR? from the regression 


ù? on 1,h;, i=1,2,...,N, (6.38) 


where R? is just the usual centered R-squared. Now, if the u? were used in place of 
the a7, we know that, under Ho and condition (6.37), NR? ~ 6, where Q is the di- 
mension of h;. 

What adjustment is needed because we have estimated u?? It turns out that, be- 
cause of the structure of these tests, no adjustment is needed to the asymptotics. (This 
statement is not generally true for regressions where the dependent variable has been 
estimated in a first stage; the current setup is special in that regard.) After tedious 
algebra, it can be shown that 


N N 
No? Soni (a? — 67) = NO"? $ (hy — m) (u? — a°) + op (1); (6.39) 
i=l i=l 
see Problem 6.5. Along with condition (6.37), this equation can be shown to justify 
the NR? test from regression (6.38). 

Two popular tests are special cases. Koenker’s (1981) version of the Breusch and 
Pagan (1979) test is obtained by taking h; = x;, so that Q = K. (The original version 
of the Breusch-Pagan test relies heavily on normality of the u;, in particular x? = 307, 
so that Koenker’s version based on NR? in regression (6.38) is preferred.) White’s 
(1980b) test is obtained by taking h; to be all nonconstant, unique elements of x; and 
x/x;: the levels, squares, and cross products of the regressors in the conditional mean. 

The Breusch-Pagan and White tests have degrees of freedom that depend on the 
number of regressors in E(y| x). Sometimes we want to conserve on degrees of free- 
dom. A test that combines features of the Breusch-Pagan and White tests but that has 
only two dfs takes h; = (#;,?), where the ĵ, are the OLS fitted values. (Recall that 
these are linear functions of the x;.) To justify this test, we must be able to replace 
h(x;) with h(x;, Ê). We discussed the generated regressors problem for OLS in Section 
6.1.1 and concluded that, for testing purposes, using estimates from earlier stages 
causes no complications. This is the case here as well: NR? from a? on 1, J;, 97, 
i=1,2,...,N has a limiting y2 distribution under the null hypothesis, along with 
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condition (6.37). This is easily seen to be a special case of the White test because 
(¥;,97) contains two linear combinations of the squares and cross products of all 
elements in x;. 

A simple modification is available for relaxing the auxiliary homokurtosis as- 
sumption (6.37). Following the work of Wooldridge (1990)—or, working directly 
from the representation in equation (6.39), as in Problem 6.5—it can be shown that 
N — SSRo from the regression (without a constant) 


1 on (h; — h) (ù? — 6), $= 1,2 cag (6.40) 


is distributed asymptotically as Xo under Ho (there are Q regressors in regression 
(6.40)). This test is very similar to the heteroskedasticity-robust LM statistics derived 
in Chapter 4. It is sometimes called a heterokurtosis-robust test for heteroskedasticity. 

If we allow some elements of x; to be endogenous but assume we have instruments 
z; such that E(u; |z;) = 0 and the rank condition holds, then we can test Ho: E(u? | z;) 
= g? (which implies Assumption 2SLS.3). Let h; = h(z;) be a 1 x Q function of the 
exogenous variables. The statistics are computed as in either regression (6.38) or 
(6.40), depending on whether the homokurtosis is maintained, where the #; are the 
2SLS residuals. There is, however, one caveat. For the validity of the asymptotic 
variances that these regressions implicitly use, an additional assumption is needed 
under Ho: Cov(x;,u;|z;) must be constant. This covariance is zero when Z; = xi, 
so there is no additional assumption when the regressors are exogenous. Without 
the assumption of constant conditional covariance, the tests for heteroskedasticity 
are more complicated. For details, see Wooldridge (1990). Baum, Schaffer, and 
Stillman (2003) review tests that do not require the constant conditional covariance 
assumption. 

You should remember that h; (or h;) must only be a function of exogenous vari- 
ables and estimated parameters; it should not depend on endogenous elements of x;. 
Therefore, when x; contains endogenous variables, it is not valid to use x;f and 
(x;f)? as elements of h;. It is valid to use, say, XB and RY, where the x; are the 
first-stage fitted values from regressing x; on Z;. 


6.4 Correlated Random Coefficient Models 


In Section 4.3.3, we discussed models where unobserved heterogeneity interacts with 
one or more explanatory variables and mentioned how they can be interpreted as 
“random coefficient’? models. Recall that if the heterogeneity is independent of the 
covariates, then the usual OLS estimator that ignores the heterogeneity consistently 
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estimates the average partial effect (APE). Or, if we have suitable proxy variables for 
the unobserved heterogeneity, we can simply add interactions between the covariates 
and the (demeaned) proxy variables to estimate the APEs. 

Consistent estimation of APEs is more difficult if one or more explanatory vari- 
ables are endogenous (and there are no proxy variables that break the correlation 
between the unobservable variables and the endogenous variables). In this section, we 
provide a relatively simple analysis that is suitable for continuous, or roughly con- 
tinuous, endogenous explanatory variables. We will continue our treatment of such 
models in Part IV, after we know more about estimation of models for discrete 
response. 


6.4.1 When Is the Usual IV Estimator Consistent? 


We study a special case of Wooldridge (2003b), who allows for multiple endogenous 
explanatory variables. A slight modification of equation (6.9) is 


Yı =M +710) + ay2+ u1, (6.41) 


where zı is 1 x Lj, y2 is the endogenous explanatory variable, and a is the “‘coeffi- 
cient? on y2—an unobserved random variable. (The reason we now set apart the 
intercept in (6.41) will be clear shortly.) We could replace 6; with a random vector, 
say dı, without substantively changing the following analysis, because we would as- 
sume E(d, |z) = E(d;) = 6; for the exogenous variables z. The additional part of the 
error term, z|(d; — d;), has a zero mean conditional on z, and so its presence would 
not affect our approach to estimation. The interesting feature of equation (6.41) is 
that the random coefficient, a;, might be correlated with y2. Following Heckman and 
Vytlacil (1998), we refer to (6.41) as a correlated random coefficient (CRC) model. 

It is convenient to write a} = « + v1, where a = E(a;) is the object of interest. We 
can rewrite the equation as 


Vi = Mı +20) + 1V2 + V1y2 + U1 =, +210) + 1V2 + e1, (6.42) 


where e} = v1 2 + u. Equation (6.42) shows explicitly a constant coefficient on y2 
(which we hope to estimate) but also an interaction between the unobserved hetero- 
geneity, vı, and y2. Remember, equation (6.42) is a population model. For a random 
draw, we would write y; = 4) + 210) + %Vi2 + vayn + Uj, Which makes it clear that 
6, and « are parameters to estimate and v,; is specific to observation i. 

As discussed in Wooldridge (1997b, 2003b), the potential problem with applying 
instrumental variables (2SLS) to (6.42) is that the error term vjy2 + u1 is not neces- 
sarily uncorrelated with the instruments z, even if we make the assumptions 
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E(u |z) = E(vı |z) = 0, (6.43) 


which we maintain from here on. Generally, the term viy can cause problems for 
IV estimation, but it is important to be clear about the nature of the problem. If we 
are allowing y2 to be correlated with u1, then we also want to allow yz and vı to be 
correlated. In other words, E(viy2) = Cov(vi, y2) = 7, # 0. But a nonzero uncondi- 
tional covariance is not a problem with applying IV to equation (6.42); it simply 
implies that the composite error term, e}, has (unconditional) mean 7, rather than 
zero. As we know, a nonzero mean for e; means that the orginal intercept, nı, would 
be inconsistently estimated, but this is rarely a concern. 

Therefore, we can allow Cov(vj, y2), the unconditional covariance, to be un- 
restricted. But the usual IV estimator is generally inconsistent if E(v;y2|z) depends 
on z. (There are still cases, which we will cover in Part IV, where the IV estimator is 
consistent.) Note that, because E(v; |z) = 0, E(u; y2|z) = Cov(v, y2 |z). Therefore, 
as shown in Wooldridge (2003b), a sufficient condition for the IV estimator applied 
to equation (6.42) to be consistent for 6; and a is 


Cov(v1, y2 |z) = Cov(v}, y2). (6.44) 


The 2SLS intercept estimator is consistent for 4, + t1. Condition (6.44) means that 
the conditional covariance between vı and yz is not a function of z, but the un- 
conditional covariance is unrestricted. 

Because vı is unobserved, we cannot generally verify condition (6.44). But it is easy 
to find situations where it holds. For example, if we write 


y2 = m2(z) + v9 (6.45) 


and assume (vı, v2) is independent of z (with zero mean), then condition (6.44) 
is easily seen to hold because Cov(v1, y2|z) = Cov(v1,v2|z), and the latter cannot 
be a function of z under independence. Of course, assuming v in equation (6.45) is 
independent of z is a strong assumption even if we do not need to specify the mean 
function, m(z). It is much stronger than just writing down a linear projection of yz 
on z (which is no real assumption at all). As we will see in various models in Part IV, 
the representation (6.45) with v independent of z is not suitable for discrete y2, 
and generally (6.44) is not a good assumption when yz has discrete characteristics. 
Further, as discussed in Card (2001), condition (6.44) can be violated even if y2 is 
(roughly) continuous. Wooldridge (2005c) makes some headway in relaxing condi- 
tion (6.44), but such methods are beyond the scope of this chapter. 

A useful extension of equation (6.41) is to allow observed exogenous variables to 
interact with y2. The most convenient formulation is 
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Vi =m +20) + yo + (21 — Wy) yay) + vipa + U1, (6.46) 


where yw, = E(z;) is the 1 x Lı vector of population means of the exogenous vari- 
ables and y; is an Lı x 1 parameter vector. As we saw in Chapter 4, subtracting the 
mean from zı before forming the interaction with yz ensures that « is the average 
partial effect. 

Estimation of equation (6.46) is simple if we maintain condition (6.44) (along with 
(6.43) and the appropriate rank condition). Typically, we would replace the unknown 
y with the sample averages, Z1, and then estimate 


ya = Qi +210) + «yn + (Za — Z1) Vi2y, + error; (6.47) 


by instrumental variables, ignoring the estimation error in the population mean 
(see Problem 6.10 for justification). The only issue is choice of instruments, which is 
complicated by the interaction term. One possibility is to use interactions between 
zi; and all elements of z; (including z;1). This results in many overidentifying restric- 
tions, even if we just have one instrument zn for yp. Alternatively, we could obtain 
fitted values from a first-stage linear regression y; on Z; Jj. = Z2, and then use 
IVs [1,z;, (Za — 21) ¥j2], which results in as many overidentifying restrictions as for 
the model without the interaction. Note that the use of (za — Z1)¥j as IVs for 
(Zi — Z1) yin is asymptotically the same as using instruments (Z; — w,) - (z;2), where 
L(y2 |Z) = zm is the linear projection. In other words, consistency of this IV proce- 
dure does not in any way restrict the nature of the distribution of y2 given z. Plus, 
although we have generated instruments, the assumptions sufficient for ignoring es- 
timation of the instruments hold, and so inference is standard (perhaps made robust 
to heteroskedasticity, as usual). In Chapter 8 we will develop the tools that allow us 
to determine when this choice of instruments produces the asymptotically efficient IV 
estimator. 

We can just identify the parameters in equation (6.46) by using a further restricted 
set of instruments, |1, Zza, fp, (Zi — Z1) 92]. If so, it is important to use these as 
instruments and not as regressors. The latter procedure is suggested by Heckman and 
Vytlacil (1998), which I will refer to as HV (1998), under the assumptions (6.43) and 
(6.44), along with 


E(y2|z) = zm (6.48) 


(where z includes a constant). Under assumptions (6.43), (6.44), and (6.48), it is easy 
to show that 


E(y1 |z) = (M, + 11) + 210) + % (zz) + (zı — Wy) + (22) 7). (6.49) 
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HV (1998) use this expression to suggest a two-step regression procedure. In the first 
step, y is regressed on z; to obtain the fitted values, fp, as before. In the second 
step, 4; + T1, 01, %1, and yı are estimated from the OLS regression ya on 1, Za, Vir, 
and (Zi — Z1)¥;.. Generally, consistency of this procedure hinges on assumption 
(6.48), and it is important to see that running this regression is not the same as 
applying IV to equation (6.47) with instruments [1, za, fp, (Zi — Z1)¥j2]. (The first- 
stage regressions using the IV approach are, in effect, linear projections of y2 on 
(1,21, 272, (Z1 — w,)- (zm2)| and of (zı — w,)y2 on [1, 21,272, (z1 — y1): (zm2)]; no 
restrictions are made on E(y2|z).) In practice, the IV and two-step regression 
approaches may give similar estimates, but, even if this is the case, the IV standard 
errors need not be adjusted, whereas the second-step OLS standard errors do, be- 
cause of the generated regressors problem. 

In summary, applying standard IV methods to equation (6.47) provides a consis- 
tent estimator of the APE when condition (6.44) holds; no further restrictions on the 
distribution of y2 given z are needed. 


6.4.2 Control Function Approach 


Garen (1984) studies the model in equation (6.41) (and also allows y2 to appear as 
a quadratic and interacted with exogenous variables, but that does not change the 
control function approach). He proposed a control function approach to estimate the 
parameters. It is instructive to derive the control function approach here and to 
contrast it to the IV approaches discussed above. 

Like Heckman and Vytlacil (1998), Garen uses a particular model for E( y2 |z). In 
fact, Garen makes the assumption yz = Zm + v2, where (u,v), v2) is independent of 
z with a mean-zero trivariate normal distribution. The normality and independence 
assumptions are much stronger than needed. We can get by with 


E(u; |Z, 02) = pyv2, E(vy |Z, v2) = C1, (6.50) 
which is the same as equation (6.18). From equation (6.41), 


E(y1 |Z, y2) = m + 2161 + 12 + E(vı |Z, y2) y2 + E(u |z, y2) 


=m +210) + %1y2 + C1 ¥2V2 + p102, (6.51) 


and this equation is estimable once we estimate 2. So, Garen’s (1984) control func- 
tion procedure is first to regress y) on z and obtain the reduced-form residuals, 65, 
and then to run the OLS regression yı on 1, z1, 2, 622, 62. Under the maintained 
assumptions, Garen’s method consistently estimates 6, and «. Because the second step 
uses generated regressors, the standard errors should be adjusted for the estimation of 
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m in the first stage. Nevertheless, a test that y2 is exogenous is easily obtained from 
the usual F test of Ho : €; = 0, p; = 0 (or a heteroskedasticity-robust version). Under 
the null hypothesis, no adjustment is needed for the generated standard errors. 

Garen’s assumptions are more restrictive than those needed for the standard IV 
estimator to be consistent. For one, it would be a fluke if assumption (6.50) held 
without the conditional covariance Cov(v;, y2 |z) being independent of z. Plus, like 
HV (1998), Garen relies on a linear model for E(y2|z). Further, Garen adds the 
assumptions that E(u | v2) and E(v; | v2) are linear functions, something not needed 
for the IV approach. 

If the assumptions needed for Garen’s CF estimator to be consistent hold, it is 
likely more efficient than the IV estimator, although a comparison of the correct 
asymptotic variances is complicated. As we discussed in Section 6.2, when IV and 
control function methods lead to different estimators, the CF estimator is likely to be 
more efficient but less robust. 


6.5 Pooled Cross Sections and Difference-in-Differences Estimation 


So far our treatment of OLS and 2SLS has been explicitly for the case of random 
samples. In this section we briefly discuss how random samples from different points 
in time can be exploited, particularly for policy analysis. 


6.5.1 Pooled Cross Sections over Time 


A data structure that is useful for a variety of purposes, including policy analysis, is 
what we will call pooled cross sections over time. The idea is that during each year a 
new random sample is taken from the relevant population. Since distributions of 
variables tend to change over time, the identical distribution assumption is not usu- 
ally valid, but the independence assumption is. Sampling a changing population at 
different points in time gives rise to independent, not identically distributed (i.n.i.d.) 
observations. It is important not to confuse a pooling of independent cross sections 
with a different data structure, panel data, which we treat starting in Chapter 7. 
Briefly, in a panel data set we follow the same group of individuals, firms, cities, and 
so on over time. In a pooling of cross sections over time, there is no replicability over 
time. (Or, if units appear in more than one time period, their recurrence is treated as 
coincidental and ignored.) 

Every method we have learned for pure cross section analysis can be applied to 
pooled cross sections, including corrections for heteroskedasticity, specification test- 
ing, instrumental variables, and so on. But in using pooled cross sections, we should 
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usually include year (or other time period) dummies to account for aggregate changes 
over time. If year dummies appear in a model, and it is estimated by 2SLS, the year 
dummies are their own instruments, as the passage of time is exogenous. For an ex- 
ample, see Problem 6.8. Time dummies can also appear in tests for heteroskedasticity 
to determine whether the unconditional error variance has changed over time. 

In some cases we interact some explanatory variables with the time dummies to 
allow partial effects to change over time. For example, in estimating a wage equation 
using data sampled during different years, we might want to allow the return to 
schooling or union membership to change across time. Or we might want to deter- 
mine how the gender gap in wages has changed over time. This is easily accomplished 
by interacting the appropriate variable with a full set of year dummies (less one if we 
explicitly use the first year as the base year, as is common). Typically, one would in- 
clude a full set of time period dummies by themselves to allow for secular changes, 
including inflation and changes in productivity that are not captured by observed 
covariates. Problems 6.8 and 6.11 ask you to work through some empirical examples, 
focusing on how the results should be interpreted. 


6.5.2 Policy Analysis and Difference-in-Differences Estimation 


Much of the recent literature in policy analysis using natural experiments can be cast 
as regression with pooled cross sections with appropriately chosen interactions. In the 
simplest case, we have two time periods, say year | and year 2. There are also two 
groups, which we will call a control group and an experimental group or treatment 
group. In the natural experiment literature, people (or firms, or cities, and so on) find 
themselves in the treatment group essentially by accident. For example, to study the 
effects of an unexpected change in unemployment insurance on unemployment 
duration, we choose the treatment group to be unemployed individuals from a state 
that has a change in unemployment compensation. The control group could be un- 
employed workers from a neighboring state. The two time periods chosen would 
straddle the policy change. 

As another example, the treatment group might consist of houses in a city under- 
going unexpected property tax reform, and the control group would be houses in a 
nearby, similar town that is not subject to a property tax change. Again, the two (or 
more) years of data would include the period of the policy change. Treatment means 
that a house is in the city undergoing the regime change. 

To formalize the discussion, call A the control group, and let B denote the treat- 
ment group; the dummy variable dB equals unity for those in the treatment group 
and is zero otherwise. Letting d2 denote a dummy variable for the second (post-policy- 
change) time period, the simplest equation for analyzing the impact of the policy 
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change is 
y = Po +B, dB + ôod2 + ô1d2 - dB + u, (6.52) 


where y is the outcome of interest. The dummy variable dB captures possible differ- 
ences between the treatment and control groups prior to the policy change. The time 
period dummy, d2, captures aggregate factors that would cause changes in y even in 
the absence of a policy change. The coefficient of interest, 6;, multiplies the interac- 
tion term, d2-dB, which is the same as a dummy variable equal to one for those 
observations in the treatment group in the second period. 

The OLS estimator, ôi, has a very interesting interpretation. Let 7, ı denote the 
sample average of y for the control group in the first year, and let J4 , be the average 
of y for the control group in the second year. Define Yg ı and Fg 2 similarly. Then ôi 
can be expressed as 


ôi = (¥p,2 — FB1) — (Yao — Yai) (6.53) 


This estimator has been labeled the difference-in-differences (DD) estimator in the 
recent program evaluation literature, although it has a long history in analysis of 
variance. 

To see how effective 6, is for estimating policy effects, we can compare it with some 
alternative estimators. One possibility is to ignore the control group completely and 
use the change in the mean over time for the treatment group, Yg 2 — Yz.;, to measure 
the policy effect. The problem with this estimator is that the mean response can 
change over time for reasons unrelated to the policy change. Another possibility is to 
ignore the first time period and compute the difference in means for the treatment 
and control groups in the second time period, Jg 2 — 4,2. The problem with this pure 
cross section approach is that there might be systematic, unmeasured differences in 
the treatment and control groups that have nothing to do with the treatment; attrib- 
uting the difference in averages to a particular policy might be misleading. 

By comparing the time changes in the means for the treatment and control groups, 
both group-specific and time-specific effects are allowed for. Nevertheless, unbiased- 
ness of the DD estimator still requires that the policy change not be systematically 
related to other factors that affect y (and are hidden in u). 

In most applications, additional covariates appear in equation (6.52), for example, 
characteristics of unemployed people or housing characteristics. These account for 
the possibility that the random samples within a group have systematically differ- 
ent characteristics in the two time periods. The OLS estimator of 6; no longer has 
the simple representation in equation (6.53), but its interpretation is essentially 
unchanged. 
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Example 6.5 (Length of Time on Workers’ Compensation): Meyer, Viscusi, and 
Durbin (1995) (hereafter, MVD) study the length of time (in weeks) that an injured 
worker receives workers’ compensation. On July 15, 1980, Kentucky raised the cap 
on weekly earnings that were covered by workers’ compensation. An increase in the 
cap has no effect on the benefit for low-income workers but makes it less costly for a 
high-income worker to stay on workers’ comp. Therefore, the control group is low- 
income workers and the treatment group is high-income workers; high-income 
workers are defined as those for whom the pre-policy-change cap on benefits is 
binding. Using random samples both before and after the policy change, MVD are 
able to test whether more generous workers’ compensation causes people to stay out 
of work longer (everything else fixed). MVD start with a difference-in-differences 
analysis, using log(durat) as the dependent variable. The variable afchnge is the 
dummy variable for observations after the policy change, and highearn is the dummy 
variable for high earners. The estimated equation is 


—. 


log(durat) = 1.126 + .0077 afchnge + .256 highearn 
(0.031) (.0447) (.047) 


+ .191 afchnge-highearn. (6.54) 
(.069) 


N = 5,626, R? = 021 


Therefore, 6, = .191 (t = 2.77), which implies that the average duration on workers’ 
compensation increased by about 19 percent owing to the higher earnings cap. The 
coefficient on afchnge is small and statistically insignificant: as is expected, the in- 
crease in the earnings cap had no effect on duration for low-earnings workers. The 
coefficient on highearn shows that, even in the absence of any change in the earnings 
cap, high earners spent much more time—on the order of 100 - [exp(.256) — 1] = 29.2 
percent—on workers’ compensation. 

MVD also add a variety of controls for gender, marital status, age, industry, and 
type of injury. These allow for the fact that the kind of people and type of injuries 
differ systematically in the two years. Perhaps not surprisingly, controlling for these 
factors has little effect on the estimate of 6;; see the MVD article and Problem 6.9. 


Sometimes the two groups consist of people or cities in different states in the 
United States, often close geographically. For example, to assess the impact of 
changing alcohol taxes on alcohol consumption, we can obtain random samples on 
individuals from two states for two years. In state A, the control group, there was 
no change in alcohol taxes. In state B, taxes increased between the two years. The 
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outcome variable would be a measure of alcohol consumption, and equation (6.52) can 
be estimated to determine the effect of the tax on alcohol consumption. Other factors, 
such as age, education, and gender can be controlled for, although this procedure is 
not necessary for consistency if sampling is random in both years and in both states. 

The basic equation (6.52) can be easily modified to allow for continuous, or at least 
nonbinary, “treatments.” An example is given in Problem 6.7, where the “treatment” 
for a particular home is its distance from a garbage incinerator site. In other words, 
there is not really a control group: each unit is put somewhere on a continuum of 
possible treatments. The analysis is similar because the treatment dummy, dB, is 
simply replaced with the nonbinary treatment. 

In some cases a more convincing analysis of a policy change is available by further 
refining the definition of treatment and control groups. For example, suppose a state 
implements a change in health care policy aimed at the elderly, say people 65 and 
older, and the response variable, y, is a health outcome. One possibility is to use data 
only on people in the state with the policy change, both before and after the change, 
with the control group being people under 65 and the treatment group being people 
65 and older. This DD strategy is similar to the MVD (1995) application. The po- 
tential problem with this DD analysis is that other factors unrelated to the state’s new 
policy might affect the health of the elderly relative to the younger population, such 
as changes in health care emphasis at the federal level. A different DD analysis would 
be to use another state as the control group and use the elderly from the non-policy 
state as the control group. Here the problem is that changes in the health of the 
elderly might be systematically different across states as a result of, say, income and 
wealth differences rather than the policy change. 

A more robust analysis than either of the DD analyses described above can be 
obtained by using both a different state and a control group within the treatment 
state. If we again label the two time periods as 1 and 2, let B represent the state 
implementing the policy, and let E denote the group of elderly, then an expanded 
version of equation (6.52) is 


y = Po + f,\dB+ BodE + B,dB- dE + ôod2 + ô1d2 - dB 
+ ôzd2 - dE + ô&d2 - dB- dE + u. (6.55) 


The coefficient of interest is now 63, the coefficient on the triple interaction term, 
d2 . dB - dE. The OLS estimate 63 can be expressed as follows: 


ô; = (Vee. — Vee.) —(Vae2—-Vae1) —(Van2- Jana) (6.56) 


where the A subscript means the state not implementing the policy and the N sub- 
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script represents the nonelderly. For obvious reasons, the estimator in equation (6.56) 
is called the difference-in-difference-in-differences (DDD) estimate. (The population 
analogue of equation (6.56) is easily established from equation (6.55) by finding the 
expected values of the six groups appearing in equation (6.56).) If we drop either the 
middle term or the last term, we obtain one of the DD estimates described in 
the previous paragraph. The DDD estimate starts with the time change in averages 
for the elderly in the treatment state and then nets out the change in means for elderly 
in the control state and the change in means for the nonelderly in the treatment state. 
The hope is that this controls for two kinds of potentially confounding trends: 
changes in health status of elderly across states (which would have nothing to do with 
the policy) and changes in health status of all people living in the policy-change state 
(possibly due to other state policies that affect everyone’s health, or state-specific 
changes in the economy that affect everyone’s health). When implemented as a re- 
gression, a standard error for ô; is easily obtained, including a heteroskedasticity- 
robust standard error. As in the DD case, it is straightforward to add additional 
covariates to equation (6.55). 

The DD and DDD methodologies can be applied to more than two time periods. 
In the first case, a full set of time period dummies is added to (6.53), and a policy 
dummy replaces d2 - dB; the policy dummy is simply defined to be unity for groups 
and time periods subject to the policy. This imposes the restriction that the policy has 
the same effect in every year, an assumption that is easily relaxed. In a DDD analy- 
sis, a full set of dummies is included for each of the two groups and all time periods, 
as well as for all pairwise interactions. Then, a policy dummy (or sometimes a con- 
tinuous policy variable) measures the effect of the policy. See Gruber (1994) for an 
application to mandated maternity benefits. 

Sometimes the treatment and control groups involve multiple geographical or po- 
litical units, such as states in the United States. For example, Carpenter (2004) con- 
siders the effects of state-level zero tolerance alcohol laws for people under age 21 on 
various drinking behaviors (the outcome variable, y). In each year, a state is defined 
as a zero tolerance state or not (say, zeroto/). Carpenter uses young adults aged 21—24 
years as an additional control group in a DDD-type analysis. In Carpenter’s regres- 
sions, a full set of state dummies, year dummies, and monthly dummies are included 
(the latter to control for seasonal variations in drinking behavior). Carpenter then 
includes the zeroto/ dummy and the interaction zerotol - under21, where under21 is a 
dummy variable for the 18—20 age group. The coefficient on the latter measures the 
effect of zero tolerance laws. This is in the spirit of a DDD analysis, although Car- 
penter does not appear to include the pairwise interactions suggested by a full DDD 
approach. 
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Problems 


6.1. a. In Problem 5.4d, test the null hypothesis that educ is exogenous. 


b. Test the the single overidentifying restriction in this example. 


6.2. In Problem 5.8b, test the null hypothesis that educ and JQ are exogenous in the 
equation estimated by 2SLS. 


6.3. Consider a model for individual data to test whether nutrition affects produc- 
tivity (in a developing country): 


log( produc) = ôo + ô1exper + O2exper? + d3educ + «calories + a protein + uy, 
(6.57) 


where produc is some measure of worker productivity, calories is caloric intake per 
day, and protein is a measure of protein intake per day. Assume here that exper, 
exper’, and educ are all exogenous. The variables calories and protein are possibly 
correlated with uı (see Strauss and Thomas (1995) for discussion). Possible instru- 
mental variables for calories and protein are regional prices of various goods, such as 
grains, meats, breads, dairy products, and so on. 


a. Under what circumstances do prices make good IVs for calories and proteins? 
What if prices reflect quality of food? 
b. How many prices are needed to identify equation (6.57)? 


c. Suppose we have M prices, p,,...,pa. Explain how to test the null hypothesis 
that calories and protein are exogenous in equation (6.57). 

6.4. Consider a structural linear model with unobserved variable q: 
y=xBt+qtv,  E(v|x,q) =0. 

Suppose, in addition, that E(q |x) = xô for some K x 1 vector 6; thus, q and x are 
possibly correlated. 

a. Show that E(y |x) is linear in x. What consequences does this fact have for tests of 
functional form to detect the presence of q? Does it matter how strongly q and x are 
correlated? Explain. 

b. Now add the assumptions Var(v|x,q) =g? and Var(q|x) = ož. Show that 
Var(y |x) is constant. (Hint: E(qv |x) = 0 by iterated expectations.) What does this 
fact imply about using tests for heteroskedasticity to detect omitted variables? 
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c. Now write the equation as y = xf +u, where E(x'u) = 0 and Var(u|x) = o°. If 
E(u|x) # E(w), argue that an LM test of the form (6.28) will detect “‘hetero- 
skedasticity”’ in u, at least in large samples. 


6.5. a. Verify equation (6.39) under the assumptions E(u | x) = 0 and E(u? |x) = a”. 


b. Show that, under the additional assumption (6.27), 


E[(u? — a°)? (h; — p)’ (h; — My) = 7° E[(e — a)’ (i — a, )] 

where 4? = E[(u? — a”)’]. 

c. Explain why parts a and b imply that the LM statistic from regression (6.38) has a 
limiting 7% distribution. 

d. If condition (6.37) does not hold, obtain a consistent estimator of 
E[(u2 — o?)? (h; — w,)'(h; — u,)]. Show how this leads to the heterokurtosis-robust 


l 


test for heteroskedasticity. 


6.6. Using the test for heteroskedasticity based on the auxiliary regression ù? on ĴÎ, 
ĵ?, test the log(wage) equation in Example 6.4 for heteroskedasticity. Do you detect 
heteroskedasticity at the 5 percent level? 


6.7. For this problem use the data in HPRICE.RAW, which is a subset of the 
data used by Kiel and McClain (1995). The file contains housing prices and charac- 
teristics for two years, 1978 and 1981, for homes sold in North Andover, Massachu- 
setts. In 1981, construction on a garbage incinerator began. Rumors about the 
incinerator being built were circulating in 1979, and it is for this reason that 1978 is 
used as the base year. By 1981 it was very clear that the incinerator would be oper- 
ating soon. 


a. Using the 1981 cross section, estimate a bivariate, constant elasticity model relat- 
ing housing price to distance from the incinerator. Is this regression appropriate for 
determining the causal effects of incinerator on housing prices? Explain. 


b. Pooling the two years of data, consider the model 
log( price) = ðo + 0; y81 + 02 log(dist) + 63y81 - log(dist) + u. 


If the incinerator has a negative effect on housing prices for homes closer to the 
incinerator, what sign is 63? Estimate this model and test the null hypothesis that 
building the incinerator had no effect on housing prices. 


c. Add the variables log(intst), [log(intst)]?, log(area), log(land), age, age*, rooms, 
baths to the model in part b, and test for an incinerator effect. What do you conclude? 
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6.8. The data in FERTIL1.RAW are a pooled cross section on more than a thou- 
sand U.S. women for the even years between 1972 and 1984, inclusive; the data set is 
similar to the one used by Sander (1992). These data can be used to study the rela- 
tionship between women’s education and fertility. 


a. Use OLS to estimate a model relating number of children ever born to a woman 
(kids) to years of education, age, region, race, and type of environment reared in. 
You should use a quadratic in age and should include year dummies. What is the 
estimated relationship between fertility and education? Holding other factors fixed, 
has there been any notable secular change in fertility over the time period? 


b. Reestimate the model in part a, but use motheduc and fatheduc as instruments for 
educ. First check that these instruments are sufficiently partially correlated with educ. 
Test whether educ is in fact exogenous in the fertility equation. 


c. Now allow the effect of education to change over time by including interaction 
terms such as y74-educ, y76-educ, and so on in the model. Use interactions of time 
dummies and parents’ education as instruments for the interaction terms. Test that 
there has been no change in the relationship between fertility and education over 
time. 


6.9. Use the data in INJURY.RAW for this question. 


a. Using the data for Kentucky, reestimate equation (6.54) adding as explanatory 
variables male, married, and a full set of industry- and injury-type dummy variables. 
How does the estimate on afchnge-highearn change when these other factors are 
controlled for? Is the estimate still statistically significant? 


b. What do you make of the small R-squared from part a? Does this mean the 
equation is useless? 


c. Estimate equation (6.54) using the data for Michigan. Compare the estimate on the 
interaction term for Michigan and Kentucky, as well as their statistical significance. 


6.10. Consider a regression model with interactions and squares of some explana- 
tory variables: E(y|x) = zB, where z contains a constant, the elements of x, and 
quadratics and interactions of terms in x. 


a. Let u = E(x) be the population mean of x, and let X be the sample average based 
on the N available observations. Let Ê be the OLS estimator of £ using the N obser- 
vations on y and z. Show that VN(ĝÊ-— $) and N(x — u) are asymptotically un- 
correlated. (Hint: Write VN(Ê — £) as in equation (4.8), and ignore the o,(1) term. 
You will need to use the fact that E(w| x) = 0.) 
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b. In the model of Problem 4.8, use part a to argue that 
Avar(é) = Avar(a) + 6} Avar(2) = Avar(é) + 6} (63/N) 


where « = 8, + B34, & is the estimator of «ı if we knew py, and o} = Var(x2). 


c. How would you obtain the correct asymptotic standard error of &, having run the 
regression in Problem 4.8d? (Hint: The standard error you get from the regression is 
really se(a,). Thus you can square this to estimate Avar(,), then use the preceding 
formula. You need to estimate g2, too.) 


d. Apply the result from part c to the model in Problem 4.8; in particular, find the 
corrected asymptotic standard error for &,, and compare it with the uncorrected one 
from Problem 4.8d. (Both can be nonrobust to heteroskedasticity.) What do you 
conclude? 


6.11. The following wage equation represents the populations of working people in 
1978 and 1985: 


log(wage) = By + d0v85 + B,educ + 6, y85-educ + B,exper 
+ B3exper? + Bunion + B; female + dsy85- female + u, 


where the explanatory variables are standard. The variable union is a dummy vari- 
able equal to one if the person belongs to a union and zero otherwise. The variable 
y85 is a dummy variable equal to one if the observation comes from 1985 and zero if 
it comes from 1978. In the file CPS78_85.RAW, there are 550 workers in the sample 
in 1978 and a different set of 534 people in 1985. 


a. Estimate this equation and test whether the return to education has changed over 
the seven-year period. 


b. What has happened to the gender gap over the period? 


c. Wages are measured in nominal dollars. What coefficients would change if we 
measure wage in 1978 dollars in both years? (Hint: Use the fact that for all 1985 
observations, log(wage;/P85) = log(wage;) — log(P85), where P85 is the common 
deflator; P85 = 1.65 according to the Consumer Price Index.) 

d. Is there evidence that the variance of the error has changed over time? 

e. With wages measured nominally, and holding other factors fixed, what is the 
estimated increase in nominal wage for a male with 12 years of education? Propose a 
regression to obtain a confidence interval for this estimate. (Hint: You must replace 
y85-educ with something else.) 
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6.12. In the linear model y = xf + u, initially assume that Assumptions 2SLS.1— 
2SLS.3 hold with w in place of z, where w includes all nonredundant elements of x 
and z. 


a. Show that 
Avar|VN(Bysrs = Pors) = Avar[VN (Êsis = $) - Avar[VN (Bors — Bf) 
= P[E(x"x")} | — o[E(x’'x)] t, 


where x* = ZI is the linear projection of x on z. (Hint: It will help to write 
. N 
VN (srs — P) = Ay! (x D xu) +op(1) 
i=l 
and 


N 
VN (Bors — B) = Ay! Ca Sin) +op(1), 
i=l 


where A, = E(x*’x*) and A = E(x’x). Then use these two to obtain the joint 
asymptotic distribution of VN(Bys,5 — B) and WN(Bo,s5 — P) under Ho. Generally, 
Avar|VN (Boszs = Bors)] = VitV2—(C+C’), where Vi = Avar[VM(Bysrs — P), 
Və = Avar[VN (Bors — B)], and C is the asymptotic covariance. Under the given 
assumptions, you can show C = V3.) 

b. Show how to estimate Avar[VN (Bs; — Bors) if Assumption 2SLS.3 (and As- 
sumption OLS.3) do not hold under Ho. 


6.13. Referring to equations (6.17) and (6.18), show that if E(w, |z) =0 and 
E(u |Z, v2) = p102, then E(v2 | z) = 0. 


6.14. Let yı and yz be scalars, and suppose the structural model is 
yı = 710; + g(y2)@1 + u, E(u |z) = 0, 


where g( y2) is a 1 x G; vector of functions of y2 and z contains at least one element 
not in zı. For example, g(y2) = (y2, y2) allows for a quadratic. Or g( y2) might be a 
vector of dummy variables indicating different intervals that y2 falls into. 

Assume that y2 has a linear conditional expectation, written as 


y2 = Zm + vo, E(v |z) = 0. 


(Remember, this is much stronger than simply writing down a linear projection.) 
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Further, assume (u1, v2) is independent of z. (This pretty much rules out yz with dis- 
crete characteristics.) 


a. Show that 

E(y1 |Z, y2) = E(y1 |Z, v2) = 216) + g(y2)aı + E(u; | v2). 

b. Now add the assumption E(w; | v2) = p,v2. Propose a consistent control function 
estimator of 6), a and p}. 

c. How would you test the null hypothesis that y2 is exogenous? Be very specific. 


d. How would you modify the CF approach if E(u; | v2) = p,v2 + č (03 — t2), where 
t2 = E(v3)? How would you test the null hypothesis that y is exogenous? 


e. Would your general CF approach change if we replace g(y2) with g(z1, y2)? 
Explain. 


f. Suggest a more robust method of estimating 6; and a. In particular, suppose you 
are only willing to assume E(u) | z) = 0. 

6.15. Expand the model from Problem 6.14 to 

yı = 20; + g(21, y2)a1 + B(Z1, y2)v1 + u1 

E(u|z)=0, (vm |z) =9, 

where g(z1, y2) isa 1 x Jı vector of functions, a; is a Jı x 1 parameter vector, and vı 
is a J; x 1 vector of unobserved heterogeneity. Make the same assumptions on the 


reduced form of y2. Further, assume E(u |Z, v2) = p,v2 and E(v; |z, v2) = 0iv2 for a 
J, x 1 vector 01. 


a. Find E(y |z, y2). 
b. Propose a control function method for consistently estimating ô; and ay. 
c. How would you test the null hypothesis that y2 is exogenous? 


d. Explain in detail how to appy the CF method to Garen’s (1984) model, where 
only y2 is interacted with unobserved heterogeneity: yı = 4, + 2101 + %1y2 + 41 y + 
Z1 y2; + Viy2 + Uy. 


Appendix 6A 


We derive the asymptotic distribution of the 2SLS estimator in an equation with 
generated regressors and generated instruments. The tools needed to make the proof 
rigorous are introduced in Chapter 12, but the key components of the proof can be 
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given here in the context of the linear model. Write the model as 
y=xß+u, E(u|v)=0, 


where x = f(w, ô), 6 is a Q x 1 vector, and f is K x 1. Let 6 be a V/N-consistent es- 
timator of ô. The instruments for each i are ĉ; = g(v;,4), where g(v,4) isa lx L 
vector, 4 is an S x 1 vector of parameters, and À is \/N-consistent for 4. Let f be the 


2SLS estimator from the equation 
Yi = XiP + error, 


where į; = f(w;, ô), using instruments Z;: 


-[E)E+) Ex] E Go) 


Write y; = xP + (x; —X;))B + ui, where x; = f(w;,6). Plugging this in and multi- 
plying through by VN gives 


. N . N 
C=N'S aR; and = D=N'S 22. 


Now, using Lemma 12.1 in Chapter 12, Ĉĉ 7 E(z'x) and D a E(z’z). Further, a 
mean value expansion of the kind used in Theorem 12.3 gives 


N N 
NP X gju = NP X aia + 
i=l i=l 
where V,g(v;,4) is the Lx S$ Jacobian of g(v;,4)'. Because E(u;|v;)= 0, 


E[V; g(v;, 4)'ui] =0. It follows that N`! yr Vig(v;,A)u; = 0,(1) and, since 
VN (à — 2) = O,(1), it follows that 


N N 
NPN ju = NP Y zui + (1). 
i=1 i=l 


N`! 3 Vi g(Y;, u VN(Â-— 2) + op(1), 
i=] 


Next, using similar reasoning, 
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wry al x; —&)8 =- N- YE @ 21'VsFlw.d)| VTE- 4) + of (1) 


i=! 
= -GVN(6 — ô) + 0,(1), 


where G = E[(f Q z;)'V;f(w;,6)] and V;f(w;,6) is the K x Q Jacobian of f(w;, 6)’. 
We have used a mean value expansion and 2/(x; — X;)B = (B Q 2;)'(x; — &;)'. Now, 
assume that 


VN(6- wry a )+0,(1 


where E[r;(6)] = 0. This assumption holds for all estimators discussed so far, and it 
also holds for most estimators in nonlinear models; see Chapter 12. Collecting all 
terms gives 


VN(B - B) =(C'D"'C) CD fy i zlu; — Gr;(ô nf ot ). 


By the central limit theorem, 


VN(B — B) ~ Normal]0, (C’D"'C)'C'D-'MD"'!C(C'D"'C) |], 
where 
M = Var(z/u; — Gr;(ô)]. 


The asymptotic variance of is estimated as 


(CD) E'DI MD (C'E) N, (6.58) 

where 

A N A A 

M= N X (2ft — Gô) (2i; — Gô)’, (6.59) 
(i 

A N A A 

G=N'S (8 @%))'Vot(wi, ô), (6.60) 
i=1 

and 


tj=r(6), a =y,—-Xif. (6.61) 


160 Chapter 6 


A few comments are in order. First, estimation of å does not affect the asymptotic 
distribution of Ê. Therefore, if there are no generated regressors, the usual 2SLS in- 
ference procedures are valid (G = 0 in this case and so M = E(u?z/z;)). If G = 0 and 
E(u’z'z) = o*E(z’z), then the usual 2SLS standard errors and test statistics are valid. 
If Assumption 2SLS.3 fails, then the heteroskedasticity-robust statistics are valid. 

If G #0, then the asymptotic variance of f depends on that of 6 (through 
the presence of r;(6)). Neither the usual 2SLS variance matrix estimator nor the 
heteroskedasticity-robust form is valid in this case. The matrix M should be com- 
puted as in equation (6.59). 

In some cases, G = 0 under the null hypothesis that we wish to test. The jth row of 
G can be written as E[z;8'Vsf(wi,6)]. Now, suppose that £; is the only generated 
regressor, so that only the Ath row of Vjf(w;,6) is nonzero. But then if p, = 0, 
B'Vsf(wi, ô) = 0. It follows that G = 0 and M = E(u?z/z;), so that no adjustment for 
the preliminary estimation of ô is needed. This observation is very useful for a variety 
of specification tests, including the test for endogeneity in Section 6.3.1. We will also 
use it in sample selection contexts later on. 


7 Estimating Systems of Equations by Ordinary Least Squares 
and Generalized Least Squares 


7.1 Introduction 


This chapter begins our analysis of linear systems of equations. The first method of 
estimation we cover is system ordinary least squares, which is a direct extension of 
OLS for single equations. In some important special cases the system OLS estimator 
turns out to have a straightforward interpretation in terms of single-equation OLS 
estimators. But the method is applicable to very general linear systems of equations. 

We then turn to a generalized least squares (GLS) analysis. Under certain as- 
sumptions, GLS—or its operationalized version, feasible GLS—will turn out to be 
asymptotically more efficient than system OLS. Nevertheless, we emphasize in this 
chapter that the efficiency of GLS comes at a price: it requires stronger assumptions 
than system OLS in order to be consistent. This is a practically important point that 
is often overlooked in traditional treatments of linear systems, particularly those that 
assume the explanatory variables are nonrandom. 

As with our single-equation analysis, we assume that a random sample is available 
from the population. Usually the unit of observation is obvious—such as a worker, a 
household, a firm, or a city. For example, if we collect consumption data on various 
commodities for a cross section of families, the unit of observation is the family (not 
a commodity). 

The framework of this chapter is general enough to apply to panel data models. 
Because the asymptotic analysis is done as the cross section dimension tends to in- 
finity, the results are explicitly for the case where the cross section dimension is large 
relative to the time series dimension. (For example, we may have observations on NV 
firms over the same T time periods for each firm. Then, we assume we have a random 
sample of firms that have data in each of the T years.) The panel data model covered 
here, while having many useful applications, does not fully exploit the replicability 
over time. In Chapters 10 and 11 we explicitly consider panel data models that con- 
tain time-invariant, unobserved effects in the error term. 


7.2 Some Examples 


We begin with two examples of systems of equations. These examples are fairly gen- 
eral, and we will see later that variants of them can also be cast as a general linear 
system of equations. 


Example 7.1 (Seemingly Unrelated Regressions): The population model is a set of 
G linear equations, 
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yı = xpi +u 
Y2 = XoP) + un 
? (7.1) 
YG = XcBg + ua, 
where x, is 1 x Ky and B, is Ky x 1, g=1,2,...,G. In many applications x, is the 


same for all g (in which case the f, necessarily have the same dimension), but the 
general model allows the elements and the dimension of x, to vary across equations. 
Remember, the system (7.1) represents a generic person, firm, city, or whatever from 
the population. The system (7.1) is often called Zellner’s (1962) seemingly unrelated 
regressions (SUR) model (for cross section data in this case). The name comes from 
the fact that, since each equation in the system (7.1) has its own vector f,, it appears 
that the equations are unrelated. Nevertheless, correlation across the errors in differ- 
ent equations can provide links that can be exploited in estimation; we will see this 
point later. 

As a specific example, the system (7.1) might represent a set of demand functions 
for the population of families in a country: 


housing = Pio + B,housepre + fj foodpre + B,;clothpre + Bi 4income 
+ pissize + Pigage + uy. 
food = By) + Bx, housepre + Ba foodpre + Pazclothpre + By4income 
+ Passize + Brgage + u2. 
clothing = p39 + P3,housepre + P33 foodpre + B33clothpre + B34income 
+ B35size + Pygage + u3. 


In this example, G = 3 and x, (a 1 x 7 vector) is the same for g = 1, 2,3. 

When we need to write the equations for a particular random draw from the pop- 
ulation, y}, Xg, and ug will also contain an i subscript: equation g becomes yi = 
XigB, + Uig. For the purposes of stating assumptions, it does not matter whether or 
not we include the 7 subscript. The system (7.1) has the advantage of being less clut- 
tered while focusing attention on the population, as is appropriate for applications. 
But for derivations we will often need to indicate the equation for a generic cross 
section unit i. 

When we study the asymptotic properties of various estimators of the f,, the 
asymptotics is done with G fixed and N tending to infinity. In the household demand 
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example, we are interested in a set of three demand functions, and the unit of obser- 
vation is the family. Therefore, inference is done as the number of families in the 
sample tends to infinity. 

The assumptions that we make about how the unobservables ug are related to the 
explanatory variables (x), X2,...,Xg) are crucial for determining which estimators of 
the £, have acceptable properties. Often, when system (7.1) represents a structural 
model (without omitted variables, errors-in-variables, or simultaneity), we can as- 
sume that 


E(ug | X1, X2, .-., XG) = 0, g=l eG. (7.2) 


One important implication of assumption (7.2) is that ug is uncorrelated with the 
explanatory variables in all equations, as well as all functions of these explanatory 
variables. When system (7.1) is a system of equations derived from economic theory, 
assumption (7.2) is often very natural. For example, in the set of demand functions 
that we have presented, x, = x is the same for all g, and so assumption (7.2) is the 
same as E(u, |Xx,) = E(u |x) = 0. 

If assumption (7.2) is maintained, and if the x, are not the same across g, then any 
explanatory variables excluded from equation g are assumed to have no effect on 
expected y, once x, has been controlled for. That is, 


E(y, | X1,X2,--- XG) = E(y, | X4) = XyB,, g=1,2,...,G. (7.3) 


There are examples of SUR systems where assumption (7.3) is too strong, but stan- 
dard SUR analysis either explicitly or implicitly makes this assumption. 


Our next example involves panel data. 


Example 7.2 (Panel Data Model): Suppose that for each cross section unit we 
observe data on the same set of variables for T time periods. Let x, be a 1 x K 
vector for t = 1,2,...,7, and let B be a K x 1 vector. The model in the popula- 
tion is 


Vy, =X Pt+u, S162 seach. (7.4) 


where y, is a scalar. For example, a simple equation to explain annual family saving 
over a five-year span is 


savı = Po + Pinc, + Prager + B,educ; + ur, t= l; 2; 202,5; 


where inc; is annual income, educ; is years of education of the household head, and 
age; is age of the household head. This is an example of a linear panel data model. It 
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is a static model because all explanatory variables are dated contemporaneously with 
SAU}. 

The panel data setup is conceptually very different from the SUR example. In Ex- 
ample 7.1, each equation explains a different dependent variable for the same cross 
section unit. Here we only have one dependent variable we are trying to explain— 
sav—but we observe sav, and the explanatory variables, over a five-year period. 
(Therefore, the label “system of equations” is really a misnomer for panel data 
applications. At this point, we are using the phrase to denote more than one equation 
in any context.) As we will see in the next section, the statistical properties of esti- 
mators in SUR and panel data models can be analyzed within the same structure. 

When we need to indicate that an equation is for a particular cross section unit 7 
during a particular time period t, we write y; = xi8 + ui. We will omit the i sub- 
script whenever its omission does not cause confusion. 

What kinds of exogeneity assumptions do we use for panel data analysis? One 
possibility is to assume that u, and x, are orthogonal in the conditional mean sense: 


E(u; | x,) = 0, t= lT (7.5) 


We call this contemporaneous exogeneity of x, because it only restricts the relation- 
ship between the disturbance and explanatory variables in the same time period. 
Naturally, assumption (7.5) implies that each element of x, is uncorrelated with u,. A 
stronger assumption on the explanatory variables is 


E(u; | X;,X;-1,---,X1) = 0, t= Nise t (7.6) 


(which, of course, implies that x, is uncorrelated with u, for all s < f). When as- 
sumption (7.6) holds we say that {x;} is sequentially exogenous. When we combine 
sequential exogeneity and (7.4), we obtain 


E(yi| Xn X1,- X1) = E(y; | Xz) =xf, (7.7) 


which implies that whatever we have included in x,, no further lags of x; are 
needed to explain the expected value of y,. Once lagged variables are included in x,, it 
is often desirable (although not necessary) to have assumption (7.7) hold. For exam- 
ple, if y, is a measure of average worker productivity at the firm level, and x; includes 
a current measure of worker training along with, say, two lags, then we probably 
hope that two lags are sufficient to capture the distributed lag of productivity on 
training. However, if we have only a small number of time periods available, we 
might have to settle for two lags capturing most, but not necessarily all, of the lagged 
effect. 
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An even stronger form of exogeneity is that u; has zero mean conditional on all 
explanatory variables in all time periods: 


E(u; |X1,X2,---,X7r) = 0, t= rere! Oe (7.8) 


Under assumption (7.8), we say the {x,} are strictly exogenous. Clearly, condition 
(7.8) implies that u, is uncorrelated with all explanatory variables in all time periods, 
including future time periods. We can write assumption (7.8) under equation (7.4) as 
E(y;|x1,X2,---,Xr) = E( y: |x:) = x,B. This condition is necessarily false if, say, x/41 
includes yz. 

Contemporaneous exogeneity says nothing about the relationship between x, and 
ur for any s # t, sequential exogeneity leaves the relationship unrestricted for s > t, 
but strict exogeneity rules out correlation between the errors and explanatory vari- 
ables across all time periods. It is critically important to understand that these three 
assumptions can have different implications for the statistical properties of different 
estimators. 

To illustrate the differences among assumptions (7.5), (7.6), and (7.8), let x, = 
(1, y1). Then assumption (7.5) holds by construction if E(y;|3-1) = Bp + Biri, 
which just means that E(y,| y,-1) is linear in y,-1. The sequential exogeneity as- 
sumption holds if we further assume E(y;| Yr-1, ¥-2,---, Yo) = E( yı | y1), which 
means that only one lag of the dependent variable appears in the fully dynamic ex- 
pectation E(y;| 1-1, ¥-2,---, Yo) (in addition to E(y,|y;-1) being linear in y,_1). 
Often this is an intended assumption for a dynamic model, but it is an extra as- 
sumption compared with assumption (7.5). In this example, the strict exogeneity 
condition (7.8) must fail because x,.; = (1, y,), and therefore E(u; | X1, X2,..., Xr) = 
E(u;| ¥o,---, 7-1) =u; #0 for t= 1,2,..., 7 — 1 (because u; = y, — By — By yr-1). 
Incidentally, it is not nearly enough to assume that the unconditional expected value 
of the error is zero, something that is almost always true in regression models. 

Strict exogeneity can fail even if x, does not contain lagged dependent variables. 
Consider a model relating poverty rates to welfare spending per capita, at the city 
level. A finite distributed lag (FDL) model is 


poverty, = 0, + dgwelfare, + 6, welfare, ı + 62.welfare,_» + uy. (7.9) 


The parameter 0, simply denotes different aggregate time effects for each year. The 
contemporaneous exogeneity assumption holds if we assume E(poverty,| welfare, 
welfare,_1, welfare;—2) is linear. In this example, we probably intend that sequential 
exogeneity holds, too, but this is an empirical question: Is a two-year lag sufficient to 
capture lagged effects of welfare spending on poverty rates? 
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Even if two lags of spending suffice to capture the distributed lag dynamics, strict 
exogeneity generally fails if welfare spending reacts to past poverty rates. An equa- 
tion that captures this feedback is 


welfare, = n, + py poverty;-1 + rr. (7.10) 


Generally, the strict exogeneity assumption (7.8) will be violated if p} # 0 because 
welfare,,, depends on u, (after substituting (7.9) into (7.10)) and x,,; contains 
welfare. 


As we will see in this and the next several chapters, how we go about estimating $ 
depends crucially on whether strict exogeneity holds, or only one of the weaker as- 
sumptions. Classical treatments of ordinary least squares and genereralized least 
squares with panel data tend to treat the X; as fixed in repeated samples; in practice, 
this is the same as the strict exogeneity assumption. 


7.3 System Ordinary Least Squares Estimation of a Multivariate Linear System 


7.3.1 Preliminaries 


We now analyze a general multivariate model that contains the examples in Section 
7.2, and many others, as special cases. Assume that we have independent, identically 
distributed cross section observations {(X;, y;): i= 1,2,..., N}, where X; isa Gx K 
matrix and y, is a G x 1 vector. Thus, y, contains the dependent variables for all G 
equations (or time periods, in the panel data case). The matrix X; contains the ex- 
planatory variables appearing anywhere in the system. For notational clarity we in- 
clude the i subscript for stating the general model and the assumptions. 

The multivariate linear model for a random draw from the population can be 
expressed as 


Yy; = XiP + uj, (7.11) 


where f is the K x 1 parameter vector of interest and u; is a G x 1 vector of un- 
observables. Equation (7.11) explains the G variables y,,,..., ¥;q in terms of X; and 
the unobservables u;. Because of the random sampling assumption, we can state all 
assumptions in terms of a generic observation; in examples, we will often omit the 7 
subscript. 

Before stating any assumptions, we show how the two examples introduced in 
Section 7.2 fit into this framework. 
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Example 7.1 (SUR, continued): The SUR model (7.1) can be expressed as in 
equation (7.11) by defining y; = (Va, ¥2,---; Vig)’, Wi = (Wa, Uin, -- - , Uig)’, and 


x, 00- 0 
0 X2 0 Bi 
B, 

X=-|10 0 o hel (7.12) 
y : 
0 0 0 XiG Bo 


Note that the dimension of X; is G x (Kı + Ky)+-:-+ Kg), so we define K = 
Ki +--+ Kg. 


Example 7.2 (Panel Data, continued): The panel data model (7.6) can be expressed 
as in equation (7.11) by choosing X; to be the T x K matrix X; = (x/}, Xh,- -, Xir)". 


7.3.2 Asymptotic Properties of System Ordinary Least Squares 


Given the model in equation (7.11), we can state the key orthogonality condition for 
consistent estimation of $ by system ordinary least squares (SOLS). 


ASSUMPTION SOLS.1: E(Xiu;) = 0. 


Assumption SOLS.1 appears similar to the orthogonality condition for OLS analysis 
of single equations. What it implies differs across examples because of the multiple- 
equation nature of equation (7.11). For most applications, X; has a sufficient number 
of elements equal to unity, so that Assumption SOLS.1 implies that E(u;) = 0, and 
we assume zero mean for the sake of discussion. 

It is informative to see what Assumption SOLS.1 entails in the previous examples. 


Example 7.1 (SUR, continued): In the SUR case, Xiu; = (xjwi,... ,X;guig)', and 
so Assumption SOLS.1 holds if and only if 


E(x;,Uig) = 0, g= 1,2,.. ay G. (7.13) 


g 
Thus, Assumption SOLS.1 does not require xj, and uig to be uncorrelated when 
h#g. 

Example 7.2 (Panel Data, continued): For the panel data setup, X/u; = 37), Xiu; 
therefore, a sufficient, and very natural, condition for Assumption SOLS.1 is 


E(xlu)=0,  t=1,2,...,T. (7.14) 
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Like assumption (7.5), assumption (7.14) allows X;s and uj, to be correlated when 
s Æ t, in fact, assumption (7.14) is weaker than assumption (7.5). Therefore, As- 
sumption SOLS.1 does not impose strict exogeneity in panel data contexts. 


Assumption SOLS.1 is the weakest assumption we can impose in a regression 
framework to get consistent estimators of f. As the previous examples show, As- 
sumption SOLS.1 can hold when some elements of X; are correlated with some ele- 
ments of u;. Much stronger is the zero conditional mean assumption 


E(u; | X;) = 0, (7.15) 


or E(uig | Xi) = 0, g= 1,2,...,G. Assumption (7.15) implies, at a minimum, that 
each element of X; is uncorrelated with each element of u;. For example, in the SUR 
model, (7.15) implies that x; is uncorrelated with uig for g = h and g # h. In the 
panel data example, (7.15) is the strict exogeneity assumption (7.8). As we will see 
later, for large-sample analysis, assumption (7.15) can be relaxed to a zero correlation 
assumption—all elements of X; are uncorrelated with all elements of u;. The stronger 
assumption (7.15) is closely linked to traditional treatments of systems of equations 
under the assumption of nonrandom regressors. 
Under Assumption SOLS.1 the vector # satisfies 


E[X;(y; — XiP)] = 0, (7.16) 


or E(X/X;)B = E(X;y;). For each i, Xy; is a K x 1 random vector and X;X; is a 
K x K symmetric, positive semidefinite random matrix. Therefore, E(X/X;) is always 
a K x K symmetric, positive semidefinite nonrandom matrix (the expectation here is 
defined over the population distribution of X;). To be able to estimate f, we need to 
assume that it is the only K x 1 vector that satisfies assumption (7.16). 


ASSUMPTION SOLS.2: A = E(X/X;) is nonsingular (has rank K). 
Under Assumptions SOLS.1 and SOLS.2, we can write f as 
B= [E(X;X:)]  E(X;y;), (7.17) 


which shows that Assumptions SOLS.1 and SOLS.2 identify the vector $. The anal- 
ogy principle suggests that we estimate by the sample analogue of assumption 
(7.17). Define the system ordinary least squares (SOLS) estimator of £ as 


. N =l N 
p= (x Saxa) (m Sx") (7.18) 
il i=l 
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For computing B using matrix language programming, it is sometimes useful to write 
B = (X’X) 'X’Y, where X = (X1, X},...,X4)' is the NG x K matrix of stacked X 
and Y = (yj, y5,---,Yy)’ is the NG x 1 vector of stacked observations on the y;. For 
asymptotic derivations, equation (7.18) is much more convenient. In fact, the con- 
sistency of Ê can be read off equation (7.18) by taking probability limits. We sum- 
marize with a theorem: 


THEOREM 7.1 (Consistency of System OLS): Under Assumptions SOLS.1 and 
SOLS.2, Ê & B. 


It is useful to see what the system OLS estimator looks like for the SUR and panel 
data examples. 


Example 7.1 (SUR, continued): For the SUR model, 


Xj, Xil 0 0 :-:- 0 , 
0 xxn 0 Xi Yä 
N N Xi Vin 
Soxx=S] o 0 m 2 oo 3 
i=l i=l . i=l 
i 3 XiGg Vic 
0 0 0 +: X/GXig 


Straightforward inversion of a block diagonal matrix shows that the OLS estimator 
from equation (7.18) can be written as Ê = (f;, B},...,B¢)', where each ĝ, , is just the 
single-equation OLS estimator from the gth equation. In other words, system OLS 
estimation of an SUR model (without restrictions on the parameter vectors f,) is 


g 


equivalent to OLS equation by equation. Assumption SOLS.2 is easily seen to hold if 
E(x;,Xig) is nonsingular for all g. 


Example 7.2 (Panel Data, continued): In the panel data case, 


N N T N T 

1 S big 5 1 
> X;X; = X X XjXit5 3 Xiy; = ` X Xie Vit 
i=l i=l (=l i=l (=I 


Therefore, we can write f as 


N T A/N T 
- (E Fx) (SEx). (7.19 
i=l l i=l tl 


This estimator is called the pooled ordinary least squares (POLS) estimator because 
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it corresponds to running OLS on the observations pooled across i and t. We men- 
tioned this estimator in the context of independent cross sections in Section 6.5. The 
estimator in equation (7.19) is for the same cross section units sampled at different 
points in time. Theorem 7.1 shows that the POLS estimator is consistent under 
the orthogonality conditions in assumption (7.14) and the mild condition rank 
BO XXi) = K. 


In the general system (7.11), the system OLS estimator does not necessarily have 
an interpretation as OLS equation by equation or as pooled OLS. As we will see 
in Section 7.7 for the SUR setup, sometimes we want to impose cross equa- 
tion restrictions on the £,, in which case the system OLS estimator has no simple 
interpretation. 

While OLS is consistent under Assumptions SOLS.1 and SOLS.2, it is not neces- 
sarily unbiased. Assumption (7.15), and the finite sample assumption rank(X’'X) = 
K, do ensure unbiasedness of OLS conditional on X. (This conclusion follows be- 
cause, under independent sampling, E(u; | X1, X2,...,X.) = E(w | X;) = 0 under as- 
sumption (7.15).) We focus on the weaker Assumption SOLS.1 because assumption 
(7.15) is often violated in economic applications, something we already saw for a 
dynamic panel data model. 

For inference, we need to find the asymptotic variance of the OLS estimator under 
essentially the same two assumptions; technically, the following derivation requires 
the elements of X/uju;X; to have finite expected absolute value. From (7.18) and 
(7.11), write 


=i 
VN(B- $) = (myx) (aeS xu). 
i=l i=] 


Because E(X/u;) = 0 under Assumption SOLS.1, the CLT implies that 


N 

NV? X X'u; S Normal(0,B), (7.20) 
i=l 

where 

B = E(Xjuju/X;) = Var(X/u;). (7.21) 


In particular, N- YX; X/u; = O,(1). But (X’X/N)' = Aq! + 0,(1), so 
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VN(B - p) =A” Ca Sx) + [(X'X/N) - A7] (x > xi) 
= i=1 


i=l 


N 
=A™ (x yx) + o,(1) -O,(1), 
=I 


= Aq! (x yx) + 0,(1). (7.22) 
i=l 


Therefore, just as with single-equation OLS and 2SLS, we have obtained an asymp- 
totic representation for VN(ĝ — p) that is a nonrandom linear combination of a par- 
tial sum that satisfies the CLT. Equations (7.20) and (7.22) and the asymptotic 
equivalence lemma imply 

VN(B — p) © Normal(0, A~'BA“!). (7.23) 


We summarize with a theorem. 


THEOREM 7.2 (Asymptotic Normality of SOLS): Under Assumptions SOLS.1 and 
SOLS.2, equation (7.23) holds. 


The asymptotic variance of ĝ is 


Avar(f) = A'BA"!/N, (7.24) 


so that Avar(f) shrinks to zero at the rate 1/N, as expected. Consistent estimation of 
A is simple: 


N 
A=X’'X/N=N!S°XiX,. (7.25) 
i=1 


A consistent estimator of B can be found using the analogy principle. First, because 
B = E(X/uu!X;), N~! 0, X/uju/X; 4 B. Since the u; are not observed, we replace 
them with the SOLS residuals: 


û; = yi Xp =U; X(B B). (7.26) 


Using matrix algebra and the law of large numbers, it can be shown that 


N 
=N Y X/aja/x; > B. (7.27) 
i=l 
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(To establish equation (7.27), we need to assume that certain moments involving X; 
and u; are finite.) Therefore, Avar VN(ĝ — f) is consistently estimated by A-'BA“!, 


and Avar(f) is estimated as 


N T/N N zi 
V= (>. xx) (>: x00" ) (> xx) ; (7.28) 
i=l i=l i=l 
Under Assumptions SOLS.1 and SOLS.2, we perform inference on £ as if B is nor- 
mally distributed with mean $ and variance matrix (7.28). The square roots of the 
diagonal elements of the matrix (7.28) are reported as the asymptotic standard errors. 
The ż ratio, Ê; / se(ĝ,), has a limiting normal distribution under the null hypothesis 
Ho: £; = 0. Sometimes the ¢ statistics are treated as being distributed as 7y¢_x, which 
is asymptotically valid because NG — K should be large. 

The estimator in matrix (7.28) is another example of a robust variance matrix esti- 
mator because it is valid without any second-moment assumptions on the errors u; 
(except, as usual, that the second moments are well defined). In a multivariate setting 
it is important to know what this robustness allows. First, the G x G unconditional 
variance matrix, Q = E(u,u;), is entirely unrestricted. This fact allows cross equation 
correlation in an SUR system as well as different error variances in each equation. 
In panel data models, an unrestricted Q allows for arbitrary serial correlation and 
time-varying variances in the disturbances. A second kind of robustness is that the 
conditional variance matrix, Var(u; | X;), can depend on X; in an arbitrary, unknown 
fashion. The generality afforded by formula (7.28) is possible because of the N — oo 
asymptotics. 

In special cases it is useful to impose more structure on the conditional and un- 
conditional variance matrix of u; in order to simplify estimation of the asymptotic 
variance. We will cover an important case in Section 7.5.2. Essentially, the key re- 
striction will be that the conditional and unconditional variances of u; are the same. 

There are also some special assumptions that greatly simplify the analysis of the 
pooled OLS estimator for panel data; see Section 7.8. 


7.3.3 Testing Multiple Hypotheses 


Testing multiple hypotheses in a fully robust manner is easy once V in matrix (7.28) 
has been obtained. The robust Wald statistic for testing Ho: RB = r, where Ris Q x K 
with rank Q and r is Q x 1, has its usual form, W = (RB —r)'(RVR’)'(RB — r). 
Under Ho, W ~ Xo- In the SUR case this is the easiest and most robust way of 
testing cross equation restrictions on the parameters in different equations using sys- 
tem OLS. In the panel data setting, the robust Wald test provides a way of testing 
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multiple hypotheses about $ without assuming homoskedasticity or serial indepen- 
dence of the errors. 


7.4 Consistency and Asymptotic Normality of Generalized Least Squares 


7.4.1 Consistency 


System OLS is consistent under fairly weak assumptions, and we have seen how to 
perform robust inference using OLS. If we strengthen Assumption SOLS.1 and add 
assumptions on the conditional variance matrix of u;, we can do better using a gen- 
eralized least squares procedure. As we will see, GLS is not usually feasible because it 
requires knowing the variance matrix of the errors up to a multiplicative constant. 
Nevertheless, deriving the consistency and asymptotic distribution of the GLS esti- 
mator is worthwhile because it turns out that the feasible GLS estimator is asymp- 
totically equivalent to GLS. 

We are still interested in estimating the parameter vector f in equation (7.9), but 
consistency of GLS generally requires a stronger assumption than Assumption 
SOLS.1. Although we can, for certain purposes, get by with a weaker assumption, the 
most straightforward analysis follows from assuming each element of X; is uncorre- 
lated with each element u;. The strengthening of Assumption SOLS.1 is most easily 
stated using the Kronecker product: 


ASSUMPTION SGLS.1: E(X; © u;) = 0. 


Typically, at least one element of X; is unity, so in practice Assumption SGLS.1 
implies that E(u;) = 0. We will assume that u; has a zero mean for our dicussion but 
not in proving any results. A sufficient condition for Assumption SGLS.1 is the zero 
conditional mean assumption E(u; | X;) = 0, which, of course, also implies E(u;) = 0. 

The second moment matrix of u;—which is necessarily constant across i by the 
random sampling assumption—plays a critical role for GLS estimation of systems of 
equations. Define the G x G positive semi-definite matrix Q as 


Q = E(ujui). (7.29) 


Because E(u;) = 0 in the vast majority of applications, we will refer to Q as the un- 
conditional variance matrix of u;. For our general treatment, we assume it is actually 
positive definite. In applications where the dependent variables satisfy an adding up 
constraint across equations—such as expenditure shares summing to unity—an 
equation must be dropped to ensure that Q is nonsingular, a topic we return to in 
Section 7.3.3. 
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Having defined ©, and assuming it is nonsingular, we can state a weaker version of 
Assumption SGLS.1 that is nevertheless sufficient for consistency of the GLS (and 
feasible GLS) estimator: 


E(X/Q"'u;) = 0, (7.30) 


which simply says that the linear combination Q~'X; of X; is uncorrelated with uj. It 
follows that Assumption SGLS.1 implies (7.30); a concise proof using matrix algebra is 


vec E(X/Q~!u;) = E(vec(X/Q7'u;)] = E[(u; Q X!) vec Q7'] = 0. 


(Recall that for conformable matrices D, E, and F, vec(DEF) = (F’ & D) vec(E), 
where vec(C) is the vectorization of the matrix C; see, for example, Theil (1983).) 

If (7.30) is sufficient for consistency of GLS, how come we make Assumption 
SGLS.1? There are a couple of reasons. First, SGLS.1 is more straightforward 
to interpret, and is independent of the structure of Q. Second—and this is more 
subtle—when we turn to feasible GLS estimation in Section 7.5, Assumption SGLS.1 
is used to establish the ’N—equivalance of GLS and feasible GLS. If we knew Q, 
(7.30) would be attractive as an assumption, but we almost never know Q (even up to 
a multiplicative constant). Assumption (7.30) will be relevant in Section 7.5.3, where 
we discuss imposing diagonality on the variance matrix. 

In place of Assumption SOLS.2 we assume that a weighted expected outer product 
of X; is nonsingular. Here we insert the assumption of a nonsingular variance matrix 
for completeness: 


ASSUMPTION SGLS.2: Q is positive definite and E(X/Q7~'X;) is nonsingular. 


The usual motivation for the GLS estimator is to transform a system of equations 
where the error has a nonscalar variance-covariance matrix into a system where the 
error vector has a scalar variance-covariance matrix. We obtain this by multiplying 
equation (7.11) by Q7!/?: 


QPy, =(Q'?X)p+Q''u, or yr =X% +š. (7.31) 
Simple algebra shows that E(uju;”) = Ig. 
Now we estimate equation (7.31) by system OLS. (As yet, we have no real justifi- 


cation for this step, but we know SOLS is consistent under some assumptions.) Call 
this estimator p*. Then 


N 17H N T/N 
p= (è xx) > xs) = (>. xox) (>. xay). (7.32) 
i=1 i=1 


i=1 i=1 


Estimating Systems of Equations by OLS and GLS 175 


This is the generalized least squares (GLS) estimator of f. Under Assumption 
SGLS.2, B* exists with probability approaching one as N — oo. 

We can write f* using full matrix notation as f* = [X/(Iy @Q')x]'- 
[X’(Iy ® Q7')Y], where X and Y are the data matrices defined in Section 7.3.2 and 
Iy is the N x N identity matrix. But for establishing the asymptotic properties of B*, 
it is most convenient to work with equation (7.32). 

We can establish consistency of B* under Assumptions SGLS.1 and SGLS.2 by 
writing 


N = N 
B> =ßB+ (x yxa"'x) (x Soxa). (7.33) 
Hl i=1 


By the weak law of large numbers (WLLN), N~! Y~, X/Q-'x; + E(X/Q7'X;). By 
p 


5] 
Assumption SGLS.2 and Slutsky’s theorem (Lemma 3.4), (N AS x/0"'X;) > 
A’!, where A is now defined as 


A=E(X/Q"'X)). (7.34) 


Now we must show that plim N~! SA, X/Q°'u; = 0, which holds by the WLLN 
under Assumption SGLS.2. Thus, we have shown that the GLS estimator is consis- 
tent under SGLS.1 and SGLS.2. 

The proof of consistency that we have sketched fails if we only make Assumption 
SOLS.1: E(X/u;) = 0 does not imply E(X/Q-'u,) = 0, except when Q and X; have 
special structures. If Assumption SOLS.1 holds but Assumption SGLS.1 fails, the 
transformation in equation (7.31) generally induces correlation between X; and uj. 
This can be an important point, especially for certain panel data applications. If we 
are willing to make the zero conditional mean assumption (7.15), B* can be shown to 
be unbiased conditional on X. 


7.4.2 Asymptotic Normality 


We now sketch the asymptotic normality of the GLS estimator under Assumptions 
SGLS.1 and SGLS.2 and some weak moment conditions. The first step is familiar: 


N a N 
VN(p* — B) = G S xax) (e5 xia). (7.35) 
{=l i=l 


By the CLT, N2 3%, X/Q7!u; 4 Normal(0, B), where 
B = E(X/Q uu Q 'X;). (7.36) 
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Further, since N- YS, X/Q u; = 0,(1) and (NEA, XIX) — Av! = 
op(1), we can write /N(B* — p) = A! (N-12 EA x'Q-!u;) + 0,(1). It follows from 
the asymptotic equivalence lemma that 


VN(p* — B) © Normal(0, A“'BA~!). (7.37) 
Thus, 
Avar(B) = ABA! /N. (7.38) 


The asymptotic variance in equation (7.38) is not the asymptotic variance usually 
derived for GLS estimation of systems of equations. Typically the formula is reported 
as A`! /N. But equation (7.38) is the appropriate expression under the assumptions 
made so far. The simpler form, which results when B = A, is not generally valid 
under Assumptions SGLS.1 and SGLS.2, because we have assumed nothing about 
the variance matrix of u; conditional on X;. In Section 7.5.2 we make an assumption 
that simplifies equation (7.38). 


7.5 Feasible Generalized Least Squares 


7.5.1 Asymptotic Properties 


Obtaining the GLS estimator #* requires knowing © up to scale. That is, we must be 
able to write Q = oC, where C is a known G x G positive definite matrix and a? is 
allowed to be an unknown constant. Sometimes C is known (one case is C = Ig), but 
much more often it is unknown. Therefore, we now turn to the analysis of feasible 
GLS (FGLS) estimation. 

In FGLS estimation we replace the unknown matrix Q with a consistent estimator. 
Because the estimator of Q appears highly nonlinearly in the expression for the 
FGLS estimator, deriving finite sample properties of FGLS is generally difficult. 
(However, under essentially assumption (7.15) and some additional assumptions, 
including symmetry of the distribution of u, Kakwani (1967) showed that the distri- 
bution of the FGLS is symmetric about f, a property which means that the FGLS 
is unbiased if its expected value exists; see also Schmidt (1976, Section 2.5).) The 
asymptotic properties of the FGLS estimator are easily established as N — oo be- 
cause, as we will show, its first-order asymptotic properties are identical to those of 
the GLS estimator under Assumptions SGLS.1 and SGLS.2. It is for this purpose 
that we spent some time on GLS. After establishing the asymptotic equivalence, we 
can easily obtain the limiting distribution of the FGLS estimator. Of course, GLS is 
trivially a special case of FGLS where there is no first-stage estimation error. 
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We initially assume we have a consistent estimator, Q, of Q: 


plim Ê = Q. (7.39) 
N= 

(Because the dimension of Ê does not depend on N, equation (7.39) makes sense 
when defined element by element.) When Q is allowed to be a general positive defi- 
nite matrix, the following estimation approach can be used. First, obtain the system 
OLS estimator of $, which we denote f in this section to avoid confusion. We already 
showed that f is consistent for $ under Assumptions SOLS.1 and SOLS.2, and 
therefore under Assumptions SGLS.1 and SOLS.2. (In what follows, we assume that 
Assumptions SOLS.2 and SGLS.2 both hold.) By the WLLN, plim(N7! 5A | uW;u;) = 
Q, and so a natural estimator of Q is 


N 

Q= NTS ni, (7.40) 
i=l 

where ù; = y; — X;f are the SOLS residuals. We can show that this estimator is con- 


sistent for Q under Assumptions SGLS.1 and SOLS.2 and standard moment con- 
ditions. First, write 


a; = w — X;(B — B), (7.41) 
so that 
ai, = uu; — w (Å — B)'X! — X,(B — B)u; + X;(B — B)(B — B)'X;}. (7.42) 


Therefore, it suffices to show that the averages of the last three terms converge in 
probability to zero. Write the average of the vec of the first term as N~! Y H 1X: 8u). 
(Ř— P), which is op(1) because plim(ğ — £) = 0 and NEN (X; @u;) > 0. The 
third term is the transpose of the second. For the last term in equation (7.42), note 
that the average of its vec can be written as 


N 

N! XO (X: @ X;) - vec{(B — B)(B - B)'}- (7.43) 
i=l 

Now vec{(f — f)(B — B)'} = oœ(1). Further, assuming that each element of X; has 


finite second moment, N~! YA (X; Q X;) = O,(1) by the WLLN. This step takes 
care of the last term, since O,(1) - 0,(1) = 0,(1). We have shown that 


N 
Q = NS “ug; + o,(1), (7.44) 
i=l 
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and so equation (7.39) follows immediately. (In fact, a more careful analysis 
shows that the 0,(1) in equation (7.44) can be replaced by 0,(N~!/7); see Problem 
7.4.) 

Sometimes the elements of Q are restricted in some way. In such cases a different 
estimator of Q is often used that exploits these restrictions. As with Ê in equation 
(7.40), such estimators typically use the system OLS residuals in some fashion and 
lead to consistent estimators assuming the structure of Q is correctly specified. The 
advantage of equation (7.40) is that it is consistent for Q quite generally. However, if 
N is not very large relative to G, equation (7.40) can have poor finite sample prop- 
erties. In Section 7.5.3 we discuss the consequences of using an inconsistent estimator 
of Q. For now, we assume (7.39). 

Given Q, the feasible GLS (FGLS) estimator of Bis 


No LIN 
= (>: wax) (>: xay), (7.45) 
i=l i=l 


or, in full matrix notation, B = [X'(Iy © Q7!)X]'[X’(Iy @Q")Y]. 

We have already shown that the (infeasible) GLS estimator is consistent under 
Assumptions SGLS.1 and SGLS.2. Because Ê converges to Q, it is not surprising 
that FGLS is also consistent. Rather than show this result separately, we verify the 
stronger result that FGLS has the same limiting distribution as GLS. 

The limiting distribution of FGLS is obtained by writing 


N 
VN(B— Bp) = (» a (ama Sxin). (7.46) 
i=! 
Now 
N M N 
N2 XOX; Âu; — NPY XO 'u; = [x an (u; @ X;)' 
i=l i=l 


Under Assumption SGLS.1, the CLT implies that N- YX (w; @ X;) = O,(1). 
Because O,(1) - 0,(1) = 0,(1), it follows that 


vec(Q-! — Q7!) 


N N 
NPN XÂ u; = NW? YO XO u; + 0p(1). 

= i=l 
A similar argument shows that N'YA, XÔ X; = N-! 33%, X'Q7!X; + o(1). 
Therefore, we have shown that 
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N al N 
VN(Ê-— B) = (m S xon) (x Sroa) +0,(1). (7.47) 
izl icl 
The first term in equation (7.47) is just VN(8* — B), where f* is the GLS estimator. 
We can write equation (7.47) as 


VN(B — B*) = 0,(1), (7.48) 


which shows that Ê and f* are //N-equivalent. Recall from Chapter 3 that this 
statement is much stronger than simply saying that B* and ĝÎ are both consistent for 
B. There are many estimators, such as system OLS, that are consistent for f but are 
not v N-equivalent to p*. 

The asymptotic equivalence of # and f* has practically important consequences. The 
most important of these is that, for performing asymptotic inference about f using 
B, we do not have to worry that Ê is an estimator of Q. Of course, whether the 
asymptotic approximation gives a reasonable approximation to the actual distribu- 
tion of Ê is difficult to tell. With large N, the approximation is usually pretty good. 
But if N is small relative to G, ignoring estimation of Q in performing inference 
about f can be misleading. 

We summarize the limiting distribution of FGLS with a theorem. 


THEOREM 7.3 (Asymptotic Normality of FGLS): Under Assumptions SGLS.1 and 
SGLS.2, 


VN(B —B) ~ Normal(0, A~'BA~!), (7.49) 
where A is defined in equation (7.34) and B is defined in equation (7.36). 
In the FGLS context, a consistent estimator of A is 
a N A 
Â=N"€Ņ_ XÂ.. (7.50) 
i=l 
A consistent estimator of B is also readily available after FGLS estimation. Define 
the FGLS residuals by 
û =y- X;ĝ, i=1,2,...,N. (7.51) 


(The only difference between the FGLS and SOLS residuals is that the FGLS estima- 
tor is inserted in place of the SOLS estimator; in particular, the FGLS residuals are 
not from the transformed equation (7.31).) Using standard arguments, a consistent 
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estimator of B is 


N 
B=N'S XÂ iÂ X.. 


i=1 
The estimator of Avar(f) can be written as 


-1 


N “lyn N 
wary (Sox's) (Exara) (sox 's) > (752 
i=1 i=1 i=1 


This is the extension of the White (1980b) heteroskedasticity-robust asymptotic vari- 
ance estimator to the case of systems of equations; see also White (2001). This esti- 
mator is valid under Assumptions SGLS.1 and SGLS.2; that is, it is completely 
robust. 

Incidentally, if system OLS and feasible GLS have both been used to estimate 
f, and both estimators are presumed to be consistent, then it is legitimate to com- 
pare the fully robust variance matrices, (7.28) and (7.52), respectively, to determine 
whether FGLS appears to be asymptotically more efficient—keeping in mind that 
these are just estimates based on one random sample. The nonrobust OLS variance 
matrix generally should not be considered because it is valid only under very restric- 
tive assumptions. 


7.5.2 Asymptotic Variance of Feasible Generalized Least Squares under a Standard 
Assumption 


Under the assumptions so far, FGLS really has nothing to offer over SOLS. In ad- 
dition to being computationally more difficult, FGLS is less robust than SOLS. So 
why is FGLS used? The answer is that, under an additional assumption, FGLS is 
asymptotically more efficient than SOLS (and other estimators). First, we state the 
weakest condition that simplifies estimation of the asymptotic variance for FGLS. 
For reasons to be seen shortly, we call this a system homoskedasticity assumption. 


ASSUMPTION SGLS.3:  E(X/Q7!uju/Q7!X;) = E(X/Q7'X;), where Q = E(uju'). 


Another way to state this assumption is B = A, which, from expression (7.49), sim- 
plifies the asymptotic variance. As stated, Assumption SGLS.3 is somewhat difficult 
to interpret. When G = 1, it reduces to Assumption OLS.3. When Q is diagonal and 
X; has either the SUR or panel data structure, Assumption SGLS.3 implies a kind of 
conditional homoskedasticity in each equation (or time period). Generally, Assump- 
tion SGLS.3 puts restrictions on the conditional variances and covariances of ele- 
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ments of w. A sufficient (though certainly not necessary) condition for Assumption 
SGLS:.3 is easier to interpret: 


E(wu; | X;) = E(uju'). (7.53) 


If E(u;|X;) = 0, then assumption (7.53) is the same as assuming Var(u;|X;) = 
Var(u;) = Q, which means that each variance and each covariance of elements 
involving u; must be constant conditional on all of X;. This is a natural way of stating 
a system homoskedasticity assumption, but it is sometimes too strong. 

When G = 2, Q contains three distinct elements, o? = E(u), oł = E(u%), and 
012 = E(ujiujz). These elements are not restricted by the assumptions we have made. 
(The inequality |o12| < 102 must always hold for Q to be a nonsingular covariance 
matrix.) But assumption (7.53) requires E(u |X;) =o7, E(u3|Xi) =c}, and 
E(unun | X;) = a12: the conditional variances and covariance must not depend on X;. 

That assumption (7.53) implies Assumption SGLS.3 is a consequence of iterated 
expectations: 


E(X/Q uul Q 'X;) = E[E(X/Q7'uju!Q7'X; |X), 
= E[X/Q7' E(u! | X;)Q7'X;] = E(xX/Q7'0Q7'X;), 
= E(xX/Q"'X;). 


While assumption (7.53) is easier to intepret, we use Assumption SGLS.3 for stating 
the next theorem because there are cases, including some dynamic panel data models, 
where Assumption SGLS.3 holds but assumption (7.53) does not. 


THEOREM 7.4 (Usual Variance Matrix for FGLS): Under Assumptions SGLS.1-— 
SGLS.3, the asymptotic variance of the FGLS estimator is Avar(f) = A™!/N = 
[E(X{Q7'X,) | /N. 


We obtain an estimator of Avar(f) by using our consistent estimator of A: 


=| 
Avar(f) = A!/N = (>: xax) . (7.54) 
kel 


Equation (7.54) is the “usual” formula for the asymptotic variance of FGLS. It is 
nonrobust in the sense that it relies on Assumption SGLS.3 in addition to Assump- 
tions SGLS.1 and SGLS.2. If system heteroskedasticity in u; is suspected, then the 
robust estimator (7.52) should be used. 

Assumption (7.53) has important efficiency implications. One consequence of 
Problem 7.2 is that, under Assumptions SGLS.1, SOLS.2, SGLS.2, and (7.53), the 
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FGLS estimator is more efficient than the system OLS estimator. We can actually say 
much more: FGLS is more efficient than any other estimator that uses the ortho- 
gonality conditions E(X; ® w) = 0. This conclusion will follow as a special case of 
Theorem 8.4 in Chapter 8, where we define the class of competing estimators. If 
we replace Assumption SGLS.1 with the zero conditional mean assumption (7.15), 
then an even stronger efficiency result holds for FGLS, something we treat in 
Section 8.6. 


7.5.3 Properties of Feasible Generalized Least Squares with (Possibly Incorrect) 
Restrictions on the Unconditional Variance Matrix 


Sometimes we might wish to impose restrictions in estimating Q, possibly because 
G (in the panel data case, 7) is large relative to N (which means an unrestricted 
estimator of Q might result in poor finite sample properties of FGLS), or because a 
particular structure for suggests itself. For example, in a panel data context with 
strictly exogenous regressors, an AR(1) model of serial correlation, with variances 
constant over time, might seem plausible, and greatly conserves on parameters when 
T is even moderately large. The element (t, s) in Q is of the form a2p!-"! for |p| < 1; 
we describe this further in Section 7.8.6. In Chapter 10, we discuss an important 
panel data model with constant variances over time where the pairwise correlations 
across any two different time periods are constant. 

It is important to know that under Assumption SGLS.1, feasible GLS with an 
incorrect structure imposed on Q, is generally consistent and /N-asymptotically 
normal. To see why, let A denote an estimator that may be inconsistent for Q. Nev- 
ertheless, A usually has a well-defined, nonsingular probability limit: A = plim A. 
Then, the FGLS estimator of f using A as the variance matrix estimator is consistent 
if 
E(X/A~'u,) = 0 (7.55) 


(along with the modification of the rank condition SGLS.2 that inserts A in place of 
Q). Condition (7.55) always holds if Assumption SGLS.1 holds. Therefore, exoge- 
neity of each element of X; in each equation (time period) ensures that using an in- 
consistent estimator of Q does not result in inconsistency of FGLS. 

The //N-asymptotic equivalence between the estimators that use A and A con- 
tinues to hold under Assumption SGLS.1, and so we can conduct asymptotic infer- 
ence ignoring the first stage estimation of A. Nevertheless, the analogue of Assumption 
SGLS.3, E(X/A~‘uju/A~'X;) = E(X/A7'X;), generally fails, even under the system 
homoskedasticity assumption (7.53). (In fact, it is easy to show E(X/A7!ujuA~'X;) 
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= E(X/A~'QA™'X;) when (7.53) holds.) Therefore, if we entertain the possibility that 
the restrictions imposed in obtaining A are incorrect—as we probably should—the 
fully robust variance matrix in (7.52) should be used (with the slight notational 
change of replacing Q with A). 

We will use the findings in this section in Chapter 10, and we will extend the points 
made here to estimation of systems of nonlinear equations in Chapter 12. Problem 
7.14 asks you to show that using a consistent estimator of Q always leads to an esti- 
mator at least as efficient as using an inconsistent estimator, provided the appropriate 
system homoskedasticity assumption holds. 

Before leaving this section, it is useful to point out a case where one should use a 
restricted estimator of Q even if E(u;u/) does not satisfy the restrictions. A leading 
case is a panel data model where the regressors are contemporaneously exogenous 
but not strictly exogenous, so that Assumption SOLS.1 holds but Assumption 
SGLS.1 does not. In this case, using an unrestricted estimator of Q generally results 
in inconsistency of FGLS if Q is not diagonal. Of course, we can always apply pooled 
OLS in such cases, but POLS ignores the different error variances in the different 
time periods. We might want to exploit those different variances in estimation whether 
or not we think Q is diagonal. 

If A is a diagonal matrix containing consistent estimators of the error vari- 
ances down the diagonals, then A — A, where A = diag(a?,...,07.). It is easily seen 
that (7.55) holds under E(x; u) = 0, t = 1,..., T regardless of the actual structure of 
Q. In other words, contemporaneous exogeneity is sufficient. As shown in Problem 
7.7, the resulting FGLS estimator can be computed from weighted least squares 
where the weights for different time periods are the inverses of the estimated 
variances. 

Under contemporaneous exogeneity, the FGLS estimator based on diagonal A can 
be shown to be /N-asymptotically equivalent to the (infeasible) estimator that uses 
A, and so asymptotic inference is still straightforward. Of course, without more 
assumptions, the FGLS estimator that uses A is not necessarily more efficient than 
the pooled OLS estimator. Under some additional assumptions given in Problem 7.7 
that imply Q is diagonal, and therefore is the same as A, the FGLS estimator that 
uses A can be shown to be asymptotically more efficient than the POLS estimator. 


7.6 Testing the Use of Feasible Generalized Least Squares 


Asymptotic standard errors are obtained in the usual fashion from the asymptotic 
variance estimates. We can use the nonrobust version in equation (7.54) or, even 
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better, the robust version in equation (7.52), to construct f statistics and confidence 
intervals. Testing multiple restrictions is fairly easy using the Wald test, which always 
has the same general form. The important consideration lies in choosing the asymp- 
totic variance estimate, V. Standard Wald statistics use equation (7.54), and this 
approach produces limiting chi-square statistics under the homoskedasticity assump- 
tion SGLS.3. Completely robust Wald statistics are obtained by choosing V as in 
equation (7.52). 

If Assumption SGLS.3 holds under Ho, we can define a statistic based on the 
weighted sums of squared residuals. To obtain the statistic, we estimate the model 
with and without the restrictions imposed on f, where the same estimator of Q, usu- 
ally based on the unrestricted SOLS residuals, is used in obtaining the restricted and 
unrestricted FGLS estimators. Let ù; denote the residuals from constrained FGLS 
(with Q restrictions imposed on f) using variance matrix Ê. It can be shown that, 
under Hp and Assumptions SGLS.1—SGLS.3, 


N N 
(Eran -Yuana A xo (7.56) 
i=l i=l 


Gallant (1987) shows expression (7.56) for nonlinear models with fixed regressors; 
essentially the same proof works here under Assumptions SGLS.1-SGLS.3, as we 
will show more generally in Chapter 12. 

The statistic in expression (7.56) is the difference between the transformed sum 
of squared residuals from the restricted and unrestricted models, but it is just as easy 
to calculate expression (7.56) directly. Gallant (1987, Chap. 5) has found that an 
F statistic has better finite sample properties. The F statistic in this context is 
defined as 


N A N : N . 
(>: Qa; -Y va'a) : (>: va'a) 
i=l 


i=1 i=1 


F= ING — K)]/Q. (7.57) 


Why can we treat this equation as having an approximate F distribution? First, 
for NG—K large, Fo.nc-x ~ Xol Q. Therefore, dividing expression (7.56) by Q 
gives us an approximate Fo ng-x distribution. The presence of the other two 
terms in equation (7.57) is to improve the F-approximation. Since E(u/Q7'u,) = 
tr{E(Q-!uju!)} = tr{E(Q7'Q)} = G, it follows that (NG)! YX; u/Q-!u; + 1; re- 
placing u/Q~'u; with @/Q-'d; does not affect this consistency result. Subtracting off 
K as a degrees-of-freedom adjustment changes nothing asymptotically, and so 
(NG — K)! a û Å lû; =; Multiplying expression (7.56) by the inverse of this 
quantity does not affect its asymptotic distribution. 
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7.7 Seemingly Unrelated Regressions, Revisited 


We now return to the SUR system in assumption (7.2). We saw in Section 7.3 how to 
write this system in the form (7.11) if there are no cross equation restrictions on the 
B,. We also showed that the system OLS estimator corresponds to estimating each 
equation separately by OLS. 

As mentioned earlier, in most applications of SUR it is reasonable to assume that 
E(xj,Min) = 0, g,h=1,2,...,G, which is just Assumption SGLS.1 for the SUR 
structure. Under this assumption, FGLS will consistently estimate the £,. 

OLS equation by equation is simple to use and leads to atinata inference for each 
b, under the OLS homoskedasticity assumption E(uj, |Xig) = a; , which is standard 
in SUR contexts. So why bother using FGLS in such applications? There are two 
answers. First, as mentioned in Section 7.5.2, if we can maintain assumption (7.53) in 
addition to Assumption SGLS.1 (and SGLS.2), FGLS is asymptotically at least as 
efficient as system OLS. Second, while OLS equation by equation allows us to easily 
test hypotheses about the coefficients within an equation, it does not provide a con- 
venient way for testing cross equation restrictions. It is possible to use OLS for testing 
cross equation restrictions by using the variance matrix (7.26), but if we are willing to 
go through that much trouble, we should just use FGLS. 


7.7.1 Comparison between Ordinary Least Squares and Feasible Generalized Least 
Squares for Seemingly Unrelated Regressions Systems 


There are two cases where OLS equation by equation is algebraically equivalent to 
FGLS. The first case is fairly straightforward to analyze in our setting. 


THEOREM 7.5 (Equivalence of FGLS and OLS, I): If Q is a diagonal matrix, then 
OLS equation by equation is identical to FGLS. 


Proof: If Ê is diagonal, then Ê`! = diag(é;’,... 
matrix (7.10), straightforward algebra shows that 


,G¢°). With X; defined as in the 


XOX; = XX; and XÊ y, = WI X!y,, 


where Ý is the block diagonal matrix with asl, as its gth block. It follows that the 
FGLS estimator can be written as 


= (> wx) (>: #3) = (>. xx) (Exs) 


which is the system OLS estimator. 
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In applications, © would not be diagonal unless we impose a diagonal structure. 
Nevertheless, we can use Theorem 7.5 to obtain an asymptotic equivalance result 
when Q is diagonal. If Q is diagonal, then the GLS and OLS are algebraically iden- 
tical (because GLS uses Q). We know that FGLS and GLS are //N-asymptotically 
equivalent for any Q. Therefore, OLS and FGLS are v N-asymptotically equivalent 
if Q is diagonal, even though they are not algebraically equivalent (because Q is not 
diagonal). 

The second algebraic equivalence result holds without any restrictions on Ô. It is 
special in that it assumes that the same regressors appear in each equation. 


THEOREM 7.6 (Equivalence of FGLS and OLS, II): If xj) = xj. = --- = Xg for all i, 
that is, if the same regressors show up in each equation (for all observations), then 
OLS equation by equation and FGLS are identical. 


In practice, Theorem 7.6 holds when the population model has the same explanatory 
variables in each equation. The usual proof of this result groups all N observations 
for the first equation followed by the N observations for the second equation, and 
so on (see, for example, Greene (1997, Chap. 17)). Problem 7.5 asks you to prove 
Theorem 7.6 in the current setup, where we have ordered the observations to be 
amenable to asymptotic analysis. 

It is important to know that when every equation contains the same regressors in an 
SUR system, there is still a good reason to use a SUR software routine in obtaining 
the estimates: we may be interested in testing joint hypotheses involving parameters 
in different equations. In order to do so we need to estimate the variance matrix of B 
(not just the variance matrix of each Ê, which only allows tests of the coefficients 
within an equation). Estimating each equation by OLS does not directly yield the 
covariances between the estimators from different equations. Any SUR routine will 
perform this operation automatically, then compute F statistics as in equation (7.57) 
(or the chi-square alternative, the Wald statistic). 


Example 7.3 (SUR System for Wages and Fringe Benefits): We use the data on 
wages and fringe benefits in FRINGE.RAW to estimate a two-equation system for 
hourly wage and hourly benefits. There are 616 workers in the data set. The FGLS 
estimates are given in Table 7.1, with asymptotic standard errors in parentheses 
below estimated coefficients. 

The estimated coefficients generally have the signs we expect. Other things equal, 
people with more education have higher hourly wage and benefits, males have higher 
predicted wages and benefits ($1.79 and 27 cents higher, respectively), and people 
with more tenure have higher earnings and benefits, although the effect is diminishing 
in both cases. (The turning point for hrearn is at about 10.8 years, while for hrbens it 
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Table 7.1 


An Estimated SUR Model for Hourly Wages and Hourly Benefits 
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Explanatory Variables hrearn hrbens 
educ 459 .077 
.069) (.008) 
exper —.076 .023 
.057) (.007) 
exper? .0040 —.0005 
.0012) (.0001) 
tenure 110 054 
.084) (.010) 
tenure? —.0051 —.0012 
.0033) (.0004) 
union 808 366 
408 (.049 
south —.457 —.023 
552 (.066 
nrtheast —1.151 —.057 
(0.606 (.072 
nrthcen —.636 —.038 
(.556 (.066 
married 642 058 
(.418 (.050 
white 1.141 .090 
(0.612 (.073 
male 1.785 .268 
(0.398 (.048 
intercept —2.632 —.890 
(1.228 (.147 


is 22.5 years.) The coefficients on experience are interesting. Experience is estimated 
to have a dimininshing effect for benefits but an increasing effect for earnings, although 
the estimated upturn for earnings is not until 9.5 years. 

Belonging to a union implies higher wages and benefits, with the benefits coefficient 
being especially statistically significant (t ~ 7.5). 

The errors across the two equations appear to be positively correlated, with an 
estimated correlation of about .32. This result is not surprising: the same unobserv- 
ables, such as ability, that lead to higher earnings, also lead to higher benefits. 

Clearly, there are significant differences between males and females in both earn- 
ings and benefits. But what about between whites and nonwhites, and married and 
unmarried people? The F-type statistic for joint significance of married and white in 
both equations is F = 1.83. We are testing four restrictions (Q = 4), N = 616, G = 2, 
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and K = 2(13) = 26, so the degrees of freedom in the F distribution are 4 and 1,206. 
The p-value is about .121, so these variables are jointly insignificant at the 10 per- 
cent level. 


If the regressors are different in different equations, Q is not diagonal, and the 
conditions in Section 7.5.2 hold, then FGLS is generally asymptotically more efficient 
than OLS equation by equation. One thing to remember is that the efficiency of 
FGLS comes at the price of assuming that the regressors in each equation are 
uncorrelated with the errors in each equation. For SOLS and FGLS to be different, 
the x, must vary across g. If x, varies across g, certain explanatory variables have 
been intentionally omitted from some equations. If we are interested in, say, the first 
equation, but we make a mistake in specifying the second equation, FGLS will gen- 
erally produce inconsistent estimators of the parameters in all equations. However, 
OLS estimation of the first equation is consistent if E(xju) = 0. 

The previous discussion reflects the trade-off between efficiency and robustness that 
we often encounter in estimation problems. 


7.7.2 Systems with Cross Equation Restrictions 


So far we have studied SUR under the assumption that the f, are unrelated across 
equations. When systems of equations are used in economics, especially for modeling 
consumer and producer theory, there are often cross equation restrictions on the 
parameters. Such models can still be written in the general form we have covered, 
and so they can be estimated by system OLS and FGLS. We still refer to such sys- 
tems as SUR systems, even though the equations are now obviously related, and 
system OLS is no longer OLS equation by equation. 


Example 7.4 (SUR with Cross Equation Restrictions): Consider the two-equation 
population model 


Vi = Yio + Yu + Y12X12 + 41X13 + X2X14 + U1 (7.58) 


Y2 = Ya + Y1 X21 F 1X22 + 42X23 + Y24X24 + U2 (7.59) 


where we have imposed cross equation restrictions on the parameters in the two 
equations because g; and «) show up in each equation. We can put this model into 
the form of equation (7.11) by appropriately defining X; and $. For example, define 
B= (V10; V11; V12, X1; X2, V20; V21, V24), Which we know must be an 8 x 1 vector because 
there are eight parameters in this system. The order in which these elements appear in 
f is up to us, but once $ is defined, X; must be chosen accordingly. For each obser- 
vation i, define the 2 x 8 matrix 
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x (l Xii Xa2 X13 Xa O 0 0 ) 
i= 
0 0 0 xm xm l xm Xing 


Multiplying X; by £ gives the equations (7.58) and (7.59). 


In applications such as the previous example, it is fairly straightforward to test the 
cross equation restrictions, especially using the sum of squared residuals statistics 
(equation (7.56) or (7.57)). The unrestricted model simply allows each explanatory 
variable in each equation to have its own coefficient. We would use the unrestricted 
estimates to obtain Q, and then obtain the restricted estimates using Q. 


7.7.3 Singular Variance Matrices in Seemingly Unrelated Regressions Systems 


In our treatment so far we have assumed that the variance matrix Q of u; is non- 
singular. In consumer and producer theory applications this assumption is not always 
true in the original structural equations, because of additivity constraints. 


Example 7.5 (Cost Share Equations): Suppose that, for a given year, each firm in 
a particular industry uses three inputs, capital (K), labor (L), and materials (M). 
Because of regional variation and differential tax concessions, firms across the United 
States face possibly different prices for these inputs: let p,, denote the price of capital 
to firm i, p,, be the price of labor for firm i, and s;m denote the price of materials for 
firm i. For each firm į, let six be the cost share for capital, let sj, be the cost share for 
labor, and let s;m be the cost share for materials. By definition, six + siz + sim = 1. 
One popular set of cost share equations is 


SK = V10 + V11 log(Pix) + V12 log(Piz) + 13 108(Pim) + Uik (7.60) 
SiL = Y2 + V12 log(Dix) + X22 log(Pit) + V23 108(Pim) + uir (7.61) 
SiM = V30 + V13 log(Pix) + Y23 l08( PiL) + 733 l08(Pim) + tim (7.62) 


where the symmetry restrictions from production theory have been imposed. The 
errors Uig can be viewed as unobservables affecting production that the economist 
cannot observe. For an SUR analysis we would assume that 


E(u; | p;) = 9, (7.63) 


where u; = (ux, uit, Mim)’ and p; = (Pix; Pit» Pim). Because the cost shares must sum 
to unity for each i, 719 + Y2 + V30 = l, Yu + M12 +713 = 0, M2 + Y2 + 23 = 0, P13 + 
¥o3 + 733 = 0, and uig + uir + uim = 0. This last restriction implies that Q = Var(u;) 
has rank two. Therefore, we can drop one of the equations—say, the equation for 
materials—and analyze the equations for labor and capital. We can express the 
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restrictions on the gammas in these first two equations as 
%3 = Vu — %12 (7.64) 


723 = Y2 — V22 (7.65) 
Using the fact that log(a/b) = log(a) — log(b), we can plug equations (7.64) and 
(7.65) into equations (7.60) and (7.61) to get 

Sik = Yio +711 log(Pik/ Pim) + 712 log(Pit/ Pim) + Yik 

SiL = Yx + Yy log(Pix/ Pim) + V22 log Piz / Pim) + uiz- 


We now have a two-equation system with variance matrix of full rank, with unknown 
parameters 719, V20; V11; V12; and yo. To write this in the form (7.11), redefine u; = 
(uig, iz)’ and y; = (six, Siz)’. Take $ = (71957115712: 720: V22) and then X; must be 


X% = b log(Pik/Pim) l08(PiL/Pim) 9 0 ). 


7.66 
0 0 oea 1 bea ee 


This formulation imposes all the conditions implied by production theory. 


This model could be extended in several ways. The simplest would be to allow the 
intercepts to depend on firm characteristics. For each firm i, let z; be a 1 x J vector of 
observable firm characteristics, where za = 1. Then we can extend the model to 


Sik = 20, + 71; log(Dix/Pim) +712 log(pit/ Pim) + Mix (7.67) 
SiL = 1:02 + Y1 log(Dix/ Pim) + V22 1og( Piz / Pim) + tiL (7.68) 
where 

E(uig | Zi, Pik, Pits Pim) = 9, g=K,L. (7.69) 


Because we have already reduced the system to two equations, theory implies no 
restrictions on 6; and 6. As an exercise, you should write this system in the form 
(7.11). For example, if £ = (0},71),712,05, 729) is (2J +3) x 1, how should X; be 
defined? 

Under condition (7.69), system OLS and FGLS estimators are both consistent. 
(In this setup system OLS is not OLS equation by equation because yı) shows up in 
both equations). FGLS is asymptotically efficient if Var(u;|z;,p;) is constant. If 
Var(u; | z;,p;) depends on (z;,p;)—see Brown and Walker (1995) for a discussion of 
why we should expect it to—then we should at least use the robust variance matrix 
estimator for FGLS. In Chapter 12 we will discuss multivariate weighted least 
squares estimators that can be more efficient. 
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We can easily test the symmetry assumption imposed in equations (7.67) and 
(7.68). One approach is to first estimate the system without any restrictions on the 
parameters, in which case FGLS reduces to OLS estimation of each equation. Then, 
compute the f statistic of the difference in the estimates on log(p;z/Pim) in equation 
(7.67) and log(P;ix/Pim) in equation (7.68). Or, the F statistic from equation (7.57) 
can be used; Ô would be obtained from the unrestricted OLS estimation of each 
equation. 

System OLS has no robustness advantages over FGLS in this setup because we 
cannot relax assumption (7.69) in any useful way. 


7.8 Linear Panel Data Model, Revisited 


We now study the linear panel data model in more detail. Having data over time for 
the same cross section units is useful for several reasons. For one, it allows us to look 
at dynamic relationships, something we cannot do with a single cross section. A panel 
data set also allows us to control for unobserved cross section heterogeneity, but we 
will not exploit this feature of panel data until Chapter 10. 


7.8.1 Assumptions for Pooled Ordinary Least Squares 


We now summarize the properties of pooled OLS and feasible GLS for the linear 
panel data model 


y, =X P+ us, fH 1, 22025 7s (7.70) 


As always, when we need to indicate a particular cross section observation we include 
an i subscript, such as y. 

This model may appear overly restrictive because f is the same in each time period. 
However, by appropriately choosing x;,, we can allow for parameters changing over 
time. Also, even though we write x;,, some of the elements of x» may not be time- 
varying, such as gender dummies when 7 indexes individuals, or industry dummies 
when 7 indexes firms, or state dummies when i indexes cities. 


Example 7.6 (Wage Equation with Panel Data): Suppose we have data for the years 
1990, 1991, and 1992 on a cross section of individuals, and we would like to estimate 
the effect of computer usage on individual wages. One possible static model is 


log(wageir) = 0o + O1d91, + 02d92, + ô computer), + d2educi, 


+ d3experi, + 64 female; + tit, (7.71) 
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where d9/, and d92, are dummy indicators for the years 1991 and 1992 and com- 
puter; is a measure of how much person i used a computer during year t. The inclu- 
sion of the year dummies allows for aggregate time effects of the kind discussed in the 
Section 7.2 examples. This equation contains a variable that is constant across t, 
female;, as well as variables that can change across i and ¢, such as educi, and experi. 
The variable educ is given a t subscript, which indicates that years of education 
could change from year to year for at least some people. It could also be the case that 
educ;, is the same for all three years for every person in the sample, in which case we 
could remove the time subscript. The distinction between variables that are time- 
constant is not very important here; it becomes much more important in Chapter 10. 


As a general rule, with large N and small T it is a good idea to allow for separate 
intercepts for each time period. Doing so allows for aggregate time effects that have 
the same influence on y; for all i. 

Anything that can be done in a cross section context can also be done in a panel 
data setting. For example, in equation (7.71) we can interact female; with the time 
dummy variables to see whether the gender wage gap has changed over time, or we 
can interact educi and computer; to allow the return to computer usage to depend on 
level of education. 

The two assumptions sufficient for pooled OLS to consistently estimate f are as 
follows: 


ASSUMPTION POLS.1:  E(x/u,) = 0, t= 1,2,...,T. 
ASSUMPTION POLS.2:  rank[)>/_, E(x!x;,)] = K. 


Remember, Assumption POLS.1 says nothing about the relationship between x, and 
ur for s # t. Assumption POLS.2 essentially rules out perfect linear dependencies 
among the explanatory variables. 

To apply the usual OLS statistics from the pooled OLS regression across i and t, 
we need to add homoskedasticity and no serial correlation assumptions. The weakest 
forms of these assumptions are the following: 


ASSUMPTION POLS.3: (a) E(u?x/x,) = o7E(x!x,), t= 1,2,...,7, where o? = E(u?) 
for all t; (b) E(u,usx/x,) = 0, t # s, ts = 1,..., T. 


The first part of Assumption POLS.3 is a fairly strong homoskedasticity assumption; 
sufficient is E(u? | x;) = g? for all ¢. This means not only that the conditional variance 
does not depend on x,, but also that the unconditional variance is the same in every 
time period. Assumption POLS.3b essentially restricts the conditional covariances of 
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the errors across different time periods to be zero. In fact, since x; almost always 
contains a constant, POLS.3b requires at a minimum that E(u,;u;,) = 0, t # s. Suffi- 
cient for POLS.3b is E(ujus | X1, Xs) = 0, t # s, t,5=1,...,T. 

It is important to remember that Assumption POLS.3 implies more than just a 
certain form of the unconditional variance matrix of u = (u,...,ur)'. Assumption 
POLS.3 implies E(u;u/) = o7I7, which means that the unconditional variances are 
constant and the unconditional covariances are zero, but it also effectively restricts 
the conditional variances and covariances. 


THEOREM 7.7 (Large-Sample Properties of Pooled OLS): Under Assumptions 
POLS.1 and POLS.2, the pooled OLS estimator is consistent and asymptotically 
normal. If Assumption POLS.3 holds in addition, then Avar(f) = o?[E(X!X;)|_'/N, 


so that the appropriate estimator of Avar() is 


i=l t= 


N T =l 
6?(X’X)! =¢ (>. xix) i (7.72) 
1 
where ô? is the usual OLS variance estimator from the pooled regression 
Ji ON Xit, t= 1,2 T, CS lun N (7.73) 
It follows that the usual ¢ statistics and F statistics from regression (7.73) are ap- 


proximately valid. Therefore, the F statistic for testing Q linear restrictions on the 
K x 1 vector $ is 
(SSR, — SSR») (NT — K) 


F= . 7.14 
SSR, Q -* ey 


where SSR, is the sum of squared residuals from regression (7.73), and SSR, is the 
regression using the NT observations with the restrictions imposed. 


Why is a simple pooled OLS analysis valid under Assumption POLS.3? It is 
easy to show that Assumption POLS.3 implies that B=o7A, where B= 
SE, DE E(musx!x;), and A = S07, E(x/x;,). For the panel data case, these are the 
matrices that appear in expression (7.23). 

For computing the pooled OLS estimates and standard statistics, it does not matter 
how the data are ordered. However, if we put lags of any variables in the equation, it 
is easiest to order the data in the same way as is natural for studying asymptotic 
properties: the first T observations should be for the first cross section unit (ordered 
chronologically), the next T observations are for the next cross section unit, and so 
on. This procedure gives NT rows in the data set ordered in a very specific way. 
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Example 7.7 (Effects of Job Training Grants on Firm Scrap Rates): Using the data 
from JTRAIN1I.RAW (Holzer, Block, Cheatham, and Knott, 1993), we estimate a 
model explaining the firm scrap rate in terms of grant receipt. We can estimate the 
equation for 54 firms and three years of data (1987, 1988, and 1989). The first grants 
were given in 1988. Some firms in the sample in 1989 received a grant only in 1988, so 
we allow for a one-year-lagged effect: 


aa 


log(scrapj:) = .597 — .239 d88,— .497 d89, + .200 grant; + .049 granti 1—1, 
(.203) (.311) (.338) (.338) (.436) 


N = 54, T=3, R? = .0173 


where we have put i and ¢ subscripts on the variables to emphasize which ones change 
across firm or time. The R-squared is just the usual one computed from the pooled 
OLS regression. 

In this equation, the estimated grant effect has the wrong sign, and neither the 
current nor the lagged grant variable is statistically significant. When a lag of 
log(scrapj;) is added to the equation, the estimates are notably different. See Problem 
7.9. 


7.8.2 Dynamic Completeness 


While the homoskedasticity assumption, Assumption POLS.3a, can never be guar- 
anteed to hold, there is one important case where Assumption POLS.3b must hold. 
Suppose that the explanatory variables x, are such that, for all z, 


E( y| Xo Ye X1- -3 Y X1) = Ely, | Xz). (7.75) 


This assumption means that x, contains sufficient lags of all variables such that 
additional lagged values have no partial effect on y,. The inclusion of lagged y in 
equation (7.75) is important. For example, if z; is a vector of contemporaneous vari- 
ables such that 


E(y, |Z, Z-1,---,Z1) = E(y, | Zt, Z-1,--- Zr) 


and we choose x; = (Z;,Z/-1,---,Z-L), then E(y,|x;,x;-1,--.,X1) = E(y,|x,), that 
is, the sequential exogenecity assumption holds. But equation (7.75) need not hold. 
Generally, in static and FDL models, there is no reason to expect equation (7.75) to 
hold, even in the absence of specification problems such as omitted variables. 

We call equation (7.75) dynamic completeness of the conditional mean, which clearly 
implies sequential exogenecity. Often, we can ensure that equation (7.75) is at least 
approximately true by putting sufficient lags of z; and y, into x;. 
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In terms of the disturbances, equation (7.75) is equivalent to 
E(u; | Xt, Ut—1,Xt-1; -< -, U1, X1) = 0, (7.76) 


and, by iterated expectations, equation (7.76) implies E(u;us|x;,x;) =0, s #t. 
Therefore, equation (7.75) implies Assumption POLS.3b as well as Assumption 
POLS.1. If equation (7.75) holds along with the homoskedasticity assumption 
Var(y,|x:) = o°, then Assumptions POLS.1 and POLS.3 both hold, and standard 
OLS statistics can be used for inference. 

The following example is similar in spirit to an analysis of Maloney and McCormick 
(1993), who use a large random sample of students (including nonathletes) from 
Clemson University in a cross section analysis. 


Example 7.8 (Effect of Being in Season on Grade Point Average): The data in 
GPA.RAW are on 366 student-athletes at a large university. There are two semesters 
of data (fall and spring) for each student. Of primary interest is the “in-season” effect 
on athletes’ GPAs. The model—with i, t subscripts—is 


trmgpai = Po + B spring; + Pacumgpai + b3crsgpair + By frstsemy + Psseasony + BgSAT; 
+ f,verbmath; + Pghsperc; + Pohssize; + Byblack; + f\, female; + uir. 


The variable cumgpaj, is cumulative GPA at the beginning of the term, and this 
clearly depends on past-term GPAs. In other words, this model has something akin 
to a lagged dependent variable. In addition, it contains other variables that change 
over time (such as seasonj;) and several variables that do not (such as SAT;). We as- 
sume that the right-hand side (without u;i) represents a conditional expectation, so 
that uj is necessarily uncorrelated with all explanatory variables and any functions of 
them. It may or may not be that the model is also dynamically complete in the sense 
of equation (7.75); we will show one way to test this assumption in Section 7.8.5. The 
estimated equation is 


trmgpa; = —2.07 — .012 spring, + .315 cumgpai + .984 crsgpai 


(0.34) (.046) (.040) (.096) 
+ .769 frstsemi,— .046 season, + .00141 SAT; — .113 verbmath; 
(.120) (.047) (.00015) (.131) 
— .0066 Asperc; — .000058 Assize; — .231 black; + .286 female;. 
(.0010) (.000099) (.054) (.051) 


N=366, T=2, R?=.519 
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The in-season effect is small—an athlete’s GPA is estimated to be .046 points lower 
when the sport is in season—and it is statistically insignificant as well. The other 
coefficients have reasonable signs and magnitudes. 


Often, once we start putting any lagged values of y, into x,, then equation (7.75) is 
an intended assumption. But this generalization is not always true. In the previous 
example, we can think of the variable cumgpa as another control we are using to hold 
other factors fixed when looking at an in-season effect on GPA for college athletes: 
cumgpa can proxy for omitted factors that make someone successful in college. We 
may not care that serial correlation is still present in the error, except that, if equation 
(7.75) fails, we need to estimate the asymptotic variance of the pooled OLS estimator 
to be robust to serial correlation (and perhaps heteroskedasticity as well). 

In introductory econometrics, students are often warned that having serial corre- 
lation in a model with a lagged dependent variable causes the OLS estimators to be 
inconsistent. While this statement is true in the context of a specific model of serial 
correlation, it is not true in general, and therefore it is very misleading. (See Wool- 
dridge (2009a, Chap. 12) for more discussion in the context of the AR(1) model.) Our 
analysis shows that, whatever is included in x, pooled OLS provides consis- 
tent estimators of $ whenever E( y, | x;) = x,f; it does not matter that the u; might be 
serially correlated. 


7.8.3 Note on Time Series Persistence 


Theorem 7.7 imposes no restrictions on the time series persistence in the data 
{(Xir, Ya): t= 1,2,..., T}. In light of the explosion of work in time series economet- 
rics on asymptotic theory with persistent processes [often called unit root processes— 
see, for example, Hamilton (1994)], it may appear that we have not been careful in 
stating our assumptions. However, we do not need to restrict the dynamic behavior 
of our data in any way because we are doing fixed-7, large-N asymptotics. It is for 
this reason that the mechanics of the asymptotic analysis is the same for the SUR 
case and the panel data case. If T is large relative to N, the asymptotics here may be 
misleading. Fixing N while T grows or letting NV and T both grow takes us into the 
realm of multiple time series analysis: we would have to know about the temporal 
dependence in the data, and, to have a general treatment, we would have to assume 
some form of weak dependence (see Wooldridge, 1994a, for a discussion of weak 
dependence). Recently, progress has been made on asymptotics in panel data with 
large T and N when the data have unit roots; see, for example, Pesaran and Smith 
(1995), Moon and Phillips (2000), and Phillips and Moon (2000). 
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As an example, consider the simple AR(1) model 


Yi = Po + P1 Y1 + tr, E(u | Y1,- -< Yo) = 0. 


Assumption POLS.1 holds (provided the appropriate moments exist). Also, As- 
sumption POLS.2 can be maintained. Since this model is dynamically complete, the 
only potential nuisance is heteroskedasticity in u; that changes over time or depends 
on y,_;. In any case, the pooled OLS estimator from the regression y; on 1, y;;1, 
t=1,...,T, i=1,...,N, produces consistent, /N-asymptotically normal estima- 
tors for fixed T as N — œ, for any values of fọ and £}. 

In a pure time series case, or in a panel data case with T — œ and N fixed, we 
would have to assume |f,| < 1, which is the stability condition for an AR(1) model. 
Cases where |f,| > 1 cause considerable complications when the asymptotics is done 
along the time series dimension (see Hamilton, 1994, Chapter 19). Here, a large cross 
section and relatively short time series allow us to be agnostic about the amount of 
temporal persistence. 


7.8.4 Robust Asymptotic Variance Matrix 


Because Assumption POLS.3 can be restrictive, it is often useful to obtain a ro- 
bust estimate of Avar( B) that is valid without Assumption POLS.3. We have already 
seen the general form of the estimator, given in matrix (7.28). In the case of panel 
data, this estimator is fully robust to arbitrary heteroskedasticity—conditional or 
unconditional—and arbitrary serial correlation across time (again, conditional or 
unconditional). The residuals û; are the T x 1 pooled OLS residuals for cross sec- 
tion observation i. The fully robust variance matrix estimator can be expressed in the 
sandwich form as 


lyn T T N al 
Avar(f [5 X/X; J (>: 2, 2 tenis) 63 xx) (7.77) 


where the middle matrix can also be written as 


T N T T 

5 Te Xit + 5 5 Soa tipthisX},Xis- (7.78) 
i=l =l i=l =l s¥t 
This last expression makes it clear that the estimator is robust to arbitrary hetero- 
skedasticity and arbitrary serial correlation. Some statistical packages compute 
equation (7.77) very easily, although the command may be disguised. (For example, 
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in Stata, equation (7.77) is computed using a cluster sampling option. We cover 
cluster sampling in Chapter 21.) Whether a software package has this capability or 
whether it must be programmed by you, the data must be stored as described earlier: 
The (y,, X;) should be stacked on top of one another for i = 1,..., N. 


7.8.5 Testing for Serial Correlation and Heteroskedasticity after Pooled Ordinary 
Least Squares 


Testing for Serial Correlation It is often useful to have a simple way to detect serial 
correlation after estimation by pooled OLS. One reason to test for serial correlation 
is that it should not be present if the model is supposed to be dynamically complete in 
the conditional mean. A second reason to test for serial correlation is to see whether 
we should compute a robust variance matrix estimator for the pooled OLS estimator. 

One interpretation of serial correlation in the errors of a panel data model is that 
the error in each time period contains a time-constant omitted factor, a case we cover 
explicitly in Chapter 10. For now, we are simply interested in knowing whether or 
not the errors are serially correlated. 

We focus on the alternative that the error is a first-order autoregressive process; 
this will have power against fairly general kinds of serial correlation. Write the AR(1) 
model as 


Uy = piut + €r (7.79) 
where 
E(e; | Xt, Ut—-1, Xt-1, M-2,---) = 0. (7.80) 


Under the null hypothesis of no serial correlation, p, = 0. 
One way to proceed is to write the dynamic model under AR(1) serial correlation as 


Yi = XB + pyri + er, f=2,...,T, (7.81) 


where we lose the first time period due to the presence of u,_;. If we can observe the 
u, it is clear how we should proceed: simply estimate equation (7.75) by pooled OLS 
(losing the first time period) and perform a f¢ test on f,. To operationalize this proce- 
dure, we replace the u, with the pooled OLS residuals. Therefore, we run the regression 


ig ON Xit, Üi 1-1, ESA Ae E= bysse N, (7.82) 


and do a standard ¢ test on the coefficient of ù; +1. A statistic that is robust to arbi- 
trary heteroskedasticity in Var( y, | x;, “-1) is obtained by the usual heteroskedasticity- 
robust ¢ statistic in the pooled regression. This includes Engle’s (1982) ARCH model 
and any other form of static or dynamic heteroskedasticity. 
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Why is a ¢ test from regression (7.82) valid? Under dynamic completeness, equation 
(7.81) satisfies Assumptions POLS.1—POLS.3 if we also assume that Var(y, | Xs, u-1) 
is constant. Further, the presence of the generated regressor u;,;-; does not affect the 
limiting distribution of p, under the null because p; = 0. Verifying this claim is sim- 
ilar to the pure cross section case in Section 6.1.1. 

A nice feature of the statistic computed from regression (7.82) is that it works 
whether or not x; 1s strictly exogenous. A different form of the test is valid if we as- 
sume strict exogeneity: use the ¢ statistic on # +1 in the regression 


Üi ON Üi 4-1, b= 2y 200g 5b SANG orcad (7.83) 


or its heteroskedasticity-robust form. That this test is valid follows by applying 
Problem 7.4 and the assumptions for pooled OLS with a lagged dependent variable. 


Example 7.9 (Athletes’ Grade Point Averages, continued): We apply the test from 
regression (7.82) because cumgpa cannot be strictly exogenous (GPA this term affects 
cumulative GPA after this term). We drop the variables spring and frstsem from re- 
gression (7.82), since these are identically unity and zero, respectively, in the spring 
semester. We obtain p, = .194 and tō, = 3.18, and so the null hypothesis is rejected. 
Thus there is still some work to do to capture the full dynamics. But, if we assume 
that we are interested in the conditional expectation implicit in the estimation, we are 
getting consistent estimators. This result is useful to know because we are primarily 
interested in the in-season effect, and the other variables are simply acting as controls. 
The presence of serial correlation means that we should compute standard errors 
robust to arbitrary serial correlation (and heteroskedasticity); see Problem 7.10. 


Testing for Heteroskedasticity The primary reason to test for heteroskedasticity 
after running pooled OLS is to detect violation of Assumption POLS.3a, which is one 
of the assumptions needed for the usual statistics accompanying a pooled OLS 
regression to be valid. We assume throughout this section that E(u,|x;) =0, t= 
1,2,...,7, which strengthens Assumption POLS.1 but does not require strict exoge- 
neity. Then the null hypothesis of homoskedasticity can be stated as E(u? |x;) = 0”, 
an been! fe 

Under Ho, u? is uncorrelated with any function of x;; let h; denote a 1 x Q vector 
of nonconstant functions of Xy. In particular, hy can, and often should, contain 
dummy variables for the different time periods. 

From the tests for heteroskedasticity in Section 6.3.4. the following procedure is 
natural. Let ù? denote the squared pooled OLS residuals. Then obtain the usual R- 
squared, R?, from the regression 
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a on 1, hy, t=1,...,T, i=1,...,N (7.84) 


The test statistic is NTR2, which is treated as asymptotically Xo under Hp. (Alter- 
natively, we can use the usual F test of joint significance of hy from the pooled 
OLS regression. The degrees of freedom are Q and NT — K.) When is this procedure 
valid? 

Using arguments very similar to the cross sectional tests from Chapter 6, it can be 
shown that the statistic has the same distribution if už replaces a7; this fact is very 
convenient because it allows us to focus on the other features of the test. Effectively, 
we are performing a standard LM test of Ho: 6 = 0 in the model 
uz = ðo + hô + am t=1,2,...,T. (7.85) 


This test requires that the errors {ap} be appropriately serially uncorrelated and 
requires homoskedasticity; that is, Assumption POLS.3 must hold in equation (7.85). 
Therefore, the tests based on nonrobust statistics from regression (7.84) essentially re- 
quire that E(a? | xy) be constant—meaning that E(u | x;) must be constant under Ho. 
We also need a stronger homoskedasticity assumption; E(u? | Xit, Ui t-11; Xi,1-1,-+-) = 
a? is sufficient for the {ap} in equation (7.85) to be appropriately serially uncorrelated. 

A fully robust test for heteroskedasticity can be computed from the pooled regres- 
sion (7.84) by obtaining a fully robust variance matrix estimator for 6 [see equation 
(7.28)]; this can be used to form a robust Wald statistic. 

Since violation of Assumption POLS.3a is of primary interest, it makes sense to 
include elements of x; in hy, and possibly squares and cross products of elements of 
x;,, Another useful choice, covered in Chapter 6, is hj, = (Vj, reap the pooled OLS 
fitted values and their squares. Also, Assumption POLS.3a requires the uncondi- 
tional variances E(u?) to be the same across t. Whether they are can be tested directly 
by choosing h; to have T — 1 time dummies. 

If heteroskedasticity is detected but serial correlation is not, then the usual 
heteroskedasticity-robust standard errors and test statistics from the pooled OLS re- 
gression (7.73) can be used. 


7.8.6 Feasible Generalized Least Squares Estimation under Strict Exogeneity 


When E(uju/) 4 a7 Ir, it is reasonable to consider a feasible GLS analysis rather than 
a pooled OLS analysis. In Chapter 10 we will cover a particular FGLS analysis after 
we introduce unobserved components panel data models. With large N and small 
T, nothing precludes an FGLS analysis in the current setting. However, we must 
remember that FGLS is not even guaranteed to produce consistent, let alone efficient, 
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estimators under Assumptions POLS.1 and POLS.2. Unless Q = E(u,uj) is a diago- 
nal matrix, Assumption POLS.1 should be replaced with the strict exogeneity as- 
sumption (7.8). (Problem 7.7 covers the case when Q is diagonal.) Sometimes we are 
willing to assume strict exogeneity in static and finite distributed lag models. As we 
saw earlier, it cannot hold in models with lagged y,,, and it can fail in static models or 
distributed lag models if there is feedback from y; to future Zr. 

If we are comfortable with the strict exogeneity assumption, a useful FGLS anal- 
ysis is obtained when {u;i : t= 1,2,..., 7} is assumed to follow an AR(1) process, 
Uit = PUi,r-1 + ex. In this case, it is easy to transform the variables and compute the 
FGLS estimator from a pooled OLS regression. Let {#,:t=1,...,T;i=1,...,N} 
denote the POLS residuals and obtain ĵ as the coefficient from the regression ň; on 
it-1, t= 2,...,T, i= 1,...,N (where we must be sure to omit the first time period 
for each i). We can modify the Prais-Winsten approach (for example, Wooldridge, 
2009a, Section 12.3) to be applicable on panel data. If the AR(1) process is stable— 
that is, |p| < 1—and we assume stationary innovations {ex}, then a? = o2/(1 — p°). 
Therefore, for all t=1 observations, we define jy, =./(1—/7)ya and xj = 

(1 — p?)xi). For t= 2,...,7, Ja = Vir — PYir—-1 and Xj = Xj — Px;,,-1. Then, the 
FGLS estimator is obtained from the pooled OLS regression 


J, one, t= 1,...,7,i=1,...,N. (7.79) 


If Q truly has the AR(1) form and Assumption SGLS.3 holds, then the usual stan- 
dard errors and test statistics from (7.79) are asymptotically valid. If we have any 
doubts about the homoskedasticity assumption, or whether the AR(1) assumption 
sufficiently captures the serial dependence, we can just apply the usual fully robust 
variance matrix (7.77) and associated statistics to pooled OLS on the transformed 
variables. This allows us to probably obtain an estimator more efficient than POLS 
(on the original data) but also guards against the rather simple structure we imposed 
on Q. Of course, failure of strict exogeneity generally causes the Prais-Winsten esti- 
mator of f to be inconsistent. 

Notice that the Prais-Winsten approach allows us to use all t= 1 observations. 
With panel data, it is simply too costly to drop the first time period, as in a Cochrane- 
Orcutt approach. (Indeed, the Cochrane-Orcutt estimator is asymptotically less effi- 
cient in the panel data case with fixed T, N — œ asymptotics because it drops N 
observations relative to Prais-Winsten.) 

Some statistical packages, including Stata, allow FGLS estimation with a mis- 
specified variance matrix, but often this is under the guise of “generalized estimating 
equations,” a topic we will treat in Chapter 12. 
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Problems 


7.1. Provide the details for a proof of Theorem 7.1. 


7.2. In model (7.11), maintain Assumptions SOLS.1 and SOLS.2, and assume 
E(X/u;u/X;) = E(X/QX;), where Q = E(u;u/). (The last assumption is a different way 
of stating the homoskedasticity assumption for systems of equations; it always holds 
if assumption (7.53) holds.) Let sors denote the system OLS estimator. 

a. Show that Avar(Bsors) = [E(X/X;)]"'[E(X/QX;)][E(X/X))]'/N. 

b. How would you estimate the asymptotic variance in part a? 

c. Now add Assumptions SGLS.1-SGLS.3. Show that Avar( sors) — Avat(Brots) 
is positive semidefinite. (Hint: Show that [Avar(Becrs)] — [Avar(Bsors)] is 
p.s.d.) 

d. If, in addition to the previous assumptions, Q = o7Ig, show that SOLS and FGLS 
have the same asymptotic variance. 

e. Evaluate the following statement: “Under the assumptions of part c, FGLS is 
never asymptotically worse than SOLS, even if Q = a7Ig.” 


7.3. Consider the SUR model (7.2) under Assumptions SOLS.1, SOLS.2, and 
SGLS.3, with Q = diag(a;,...,02); thus, GLS and OLS estimation equation by 
equation are the same. (In the SUR model with diagonal Q, Assumption SOLS.1 is 
the same as Assumption SGLS.1, and Assumption SOLS.2 is the same as Assump- 
tion SGLS.2.) 


a. Show that single-equation OLS estimators from any two equations, say, Ê, and B,. 
are asymptotically uncorrelated. (That is, show that the asymptotic variance of the 
system OLS estimator f is block diagonal.) 

b. Under the conditions of part a, assume that $; and $, (the parameter vectors in 
the first two equations) have the same dimension. Explain how you would test 
Ho: B, = By against Hy: Bp, # Bo. 

c. Now drop Assumption SGLS.3, maintaining Assumptions SOLS.1 and SOLS.2 
and diagonality of Q. Suppose that Q is estimated in an unrestricted manner, so 
that FGLS and OLS are not algebraically equivalent. Show that OLS and FGLS are 
VN-asymptotically equivalent, that is, VN (Êsors — Becis) = Op(1). This is one case 
where FGLS is consistent under Assumption SOLS.1. 


7.4. Using the //N-consistency of the system OLS estimator Ê for B, for Q in 
equation (7.40) show that 
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vec[VN (Ô — Q)] = vec pre + ad —Q)]} +0,(1) 
i=l 


under Assumptions SGLS.1 and SOLS.2. (Note: This result does not hold when As- 
sumption SGLS.1 is replaced with the weaker Assumption SOLS.1.) Assume that all 
moment conditions needed to apply the WLLN and CLT are satisfied. The impor- 
tant conclusion is that the asymptotic distribution of vec VN(Q—Q) does not 
depend on that of /N(f — $), and so any asymptotic tests on the elements of Q can 
ignore the estimation of p. [Hint: Start from equation (7.42) and use the fact that 


VN(B— P) = O,(1).] 


7.5. Prove Theorem 7.6, using the fact that when X; = Ig © x;, 


N 

5 X; Ya 
N 7 : N N . i i=l 
XO XÔ X; = Â O (>: vs) and XO xXjQr'y, = (Q" @ Ix) 
— 


il 


N 

/ 
X X; Vic 
i=l 


7.6. Start with model (7.11). Suppose you wish to impose Q linear restrictions of the 
form RJ =r, where R is a Q x K matrix and r is a Q x 1 vector. Assume that R is 
partitioned as R = [R; | Ro], where R; is a Q x Q nonsingular matrix and R3 is a 
Q x (K — Q) matrix. Partition X; as X; = [Xj | X2], where X; is G x Q and Xn is 
G x (K -— Q), and partition $ as $ = ($1,2). The restrictions R$ =r can be 
expressed as Rif, + Rof, = r, or B, = Rī! (r — Rof,). Show that the restricted model 
can be written as 


Jı = Xap + w;, 

where f; = y; — Xa Rī 'r and Xj. = Xn — X Rj 'Ro. 
7.7. Consider the panel data model 

Yit = XitB + Uit, E E A 

E(uit | Xit, Ui, 1—1, Xi,t—1;---,) = 0, 

BG? |= EGR =, t=T. 


(Note that E(u? |x) does not depend on x;, but it is allowed to be a different con- 
stant in each time period.) 
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a. Show that Q = E(uju’) is a diagonal matrix. 
b. Write down the GLS estimator assuming that Q is known. 


c. Argue that Assumption SGLS.1 does not necessarily hold under the assumptions 
made. (Setting xj; = y; ,_; might help in answering this part.) Nevertheless, show that 
the GLS estimator from part b is consistent for B by showing that E(X/Q-'u;) = 0. 
(This proof shows that Assumption SGLS.1 is sufficient, but not necessary, for con- 
sistency. Sometimes E(X/Q7'u;) = 0 even though Assumption SGLS.1 does not hold.) 


d. Show that Assumption SGLS.3 holds under the given assumptions. 
e. Explain how to consistently estimate each a? (as N — 00). 


f. Argue that, under the assumptions made, valid inference is obtained by weighting 
each observation (¥,,, ir) by 1/é; and then running pooled OLS. 


g. What happens if we assume that o? = o° for all t= 1,..., T? 


7.8. Redo Example 7.3, disaggregating the benefits categories into value of vacation 
days, value of sick leave, value of employer-provided insurance, and value of pen- 
sion. Use hourly measures of these along with hrearn, and estimate an SUR model. 
Does marital status appear to affect any form of compensation? Test whether another 
year of education increases expected pension value and expected insurance by the 
same amount. 


7.9. Redo Example 7.7 but include a single lag of log(scrap) in the equation to 
proxy for omitted variables that may determine grant receipt. Test for AR(1) serial 
correlation. If you find it, you should also compute the fully robust standard errors 
that allow for abitrary serial correlation across time and heteroskedasticity. 


7.10. In Example 7.9, compute standard errors fully robust to serial correlation and 
heteroskedasticity. Discuss any important differences between the robust standard 
errors and the usual standard errors. 


7.11. Use the data in CORNWELL.RAW for this question; see Problem 4.13. 


a. Using the data for all seven years, and using the logarithms of all variables, esti- 
mate a model relating the crime rate to prbarr, prbconv, prbpris, avgsen, and polpc. 
Use pooled OLS and include a full set of year dummies. Test for serial correlation 
assuming that the explanatory variables are strictly exogenous. If there is serial cor- 
relation, obtain the fully robust standard errors. 


b. Add a one-year lag of log(crmrte) to the equation from part a, and compare with 
the estimates from part a. 
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c. Test for first-order serial correlation in the errors in the model from part b. If serial 
correlation is present, compute the fully robust standard errors. 


d. Add all of the wage variables (in logarithmic form) to the equation from part c. 
Which ones are statistically and economically significant? Are they jointly significant? 
Test for joint significance of the wage variables allowing arbitrary serial correlation 
and heteroskedasticity. 


7.12. Ifyou add wealth at the beginning of year t to the saving equation in Example 
7.2, is the strict exogeneity assumption likely to hold? Explain. 


7.13. Use the data in NBASAL.RAW to answer this question. 


a. Estimate an SUR model for the three response variables points, rebounds, and 
assists. The explanatory variables in each equation should be age, exper, exper’, coll, 
guard, forward, black, and marr. Does marital status have a positive or negative affect 
on each variable? Is it statistically significant in the assists equation? 


b. Test the hypothesis that marital status can be excluded entirely from the system. 
You may use the test that maintains Assumption SGLS.3. 


c. What do you make of the negative, statistically significant coefficients on coll in 
the three equations? 


7.14. Consider the system of equations y; = X; +u; under E(X; ®u;) = 0 and 
E(u;u; | X;) = Q. Consider the FGLS estimator with Ô consistent for Q and another 
FGLS estimator using A A # Q. Show that the former is at least as asymptoti- 
cally efficient as the latter. 


7.15. In the system of equations y; = X;8 + Z;y + u; under E(u; | X;,Z;) = 0 and 
Var(u; | X;, Z;) = Q, suppose in addition that each element of Z; is uncorrelated with 
each element of X;, E(X; © Z;) = 0. (We are thinking of cases where X; includes 
unity and so Z; is standardized to have a zero mean.) Let B be the FGLS estimator 
from the full model (obtained along with 7) using a consistent estimator Q for Q. Let 
B be the FLGS estimator on the restricted model y; = X;J + v; using a consistent es- 
timator of E(v;v!). Show that Avar[/N(B — B)] — Avar[/N(B — B)] is a least posi- 
tive semidefinite. 


S System Estimation by Instrumental Variables 


8.1 Introduction and Examples 


In Chapter 7 we covered system estimation of linear equations when the explana- 
tory variables satisfy certain exogeneity conditions. For many applications, even the 
weakest of these assumptions, Assumption SOLS.1, is violated, in which case instru- 
mental variables procedures are indispensable. 

The modern approach to system instrumental variables (SIV) estimation is based 
on the principle of generalized method of moments (GMM). Method of moments 
estimation has a long history in statistics for obtaining simple parameter estimates 
when maximum likelihood estimation requires nonlinear optimization. Hansen (1982) 
and White (1982b) showed how the method of moments can be generalized to apply to 
a variety of econometric models, and they derived the asymptotic properties of GMM. 
Hansen (1982), who coined the name “generalized method of moments,” treated time 
series data, and White (1982b) assumed independently sampled observations. 

A related class of estimators falls under the heading generalized instrumental 
variables (GIV). As we will see in Section 8.4, the GIV estimator can be viewed as 
an extension of the generalized least squares (GLS) method we covered in Chapter 
7. We will also see that the GIV estimator, because of its dependence on a GLS- 
like transformation, is inconsistent in some important cases when GMM remains 
consistent. 

Though the models considered in this chapter are more general than those treated 
in Chapter 5, the derivations of asymptotic properties of system IV estimators are 
mechanically similar to the derivations in Chapters 5 and 7. Therefore, the proofs in 
this chapter will be terse, or omitted altogether. 

In econometrics, the most familar application of SIV estimation is to a simultane- 
ous equations model (SEM). We will cover SEMs specifically in Chapter 9, but it is 
useful to begin with a typical SEM example. System estimation procedures have 
applications beyond the classical simultaneous equations methods. We will also use 
the results in this chapter for the analysis of panel data models in Chapter 11. 


Example 8.1 (Labor Supply and Wage Offer Functions): Consider the following 
labor supply function representing the hours of labor supply, h‘, at any wage, w, 
faced by an individual. As usual, we express this in population form: 


h (w) = yı + Zô + uy (8.1) 


where zı is a vector of observed labor supply shifters—including such things as 
education, past experience, age, marital status, number of children, and nonlabor 
income—and u; contains unobservables affecting labor supply. The labor supply 
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function can be derived from individual utility-maximizing behavior, and the nota- 
tion in equation (8.1) is intended to emphasize that, for given zı and uj, a labor 
supply function gives the desired hours worked at any possible wage (w) facing the 
worker. As a practical matter, we can only observe equilibrium values of hours 
worked and hourly wage. But the counterfactual reasoning underlying equation (8.1) 
is the proper way to view labor supply. 

A wage offer function gives the hourly wage that the market will offer as a function 
of hours worked. (It could be that the wage offer does not depend on hours worked, 
but in general it might.) For observed productivity attributes z2 (for example, edu- 
cation, experience, and amount of job training) and unobserved attributes u2, we 
write the wage offer function as 


w° (4) = yy + 2202 + u2. (8.2) 


Again, for given z2 and u, w°(4) gives the wage offer for an individual agreeing to 
work 4 hours. 

Equations (8.1) and (8.2) explain different sides of the labor market. However, 
rarely can we assume that an individual is given an exogenous wage offer and then, 
at that wage, decides how much to work based on equation (8.1). A reasonable 
approach is to assume that observed hours and wage are such that equations (8.1) and 
(8.2) both hold. In other words, letting (h, w) denote the equilibrium values, we have 


h=ywt+ zd, +u, (8.3) 


W = Yoh + z282 + u2. (8.4) 


Under weak restrictions on the parameters, these equations can be solved uniquely 
for (h,w) as functions of Z1, Z2, u1, u2, and the parameters; we consider this topic 
generally in Chapter 9. Further, if z; and z) are exogenous in the sense that 


E(u | Z1, Z2) = E(u | Z1, Z2) = 0, 


then, under identification assumptions, we can consistently estimate the parameters 
of the labor supply and wage offer functions. We consider identification of SEMs in 
detail in Chapter 9. We also ignore what is sometimes a practically important issue: 
the equilibrium hours for an individual might be zero, in which case w is not observed 
for such people. We deal with missing data issues in Chapter 21. 

For a random draw from the population, we can write 


hi = yiwi + 210) + ua, (8.5) 


Wi = Yohji + Zi2d2 + un. (8.6) 
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Except under very special assumptions, u; will be correlated with w;, and un will be 
correlated with h;. In other words, w; is probably endogenous in equation (8.5), and 
h; is probably endogenous in equation (8.6). It is for this reason that we study system 
instrumental variables methods. 


An example with the same statistical structure as Example 8.1 but with an omitted 
variables interpretation is motivated by Currie and Thomas (1995). 


Example 8.2 (Student Performance and Head Start): Consider an equation to test 
the effect of Head Start participation on subsequent student performance: 


score; = yı HeadStart; + Zi + ui, (8.7) 


where score; is the outcome on a test when the child is enrolled in school and 
HeadStart; is a binary indicator equal to one if child i participated in Head Start at 
an early age. The vector z; contains other observed factors, such as income, educa- 
tion, and family background variables. The error term u; contains unobserved fac- 
tors that affect score—such as child’s ability—that may also be correlated with 
HeadStart. To capture the possible endogeneity of HeadStart, we write a linear 
reduced form (linear projection) for HeadStart;: 


HeadStart; = 1;6. + uj. (8.8) 


Remember, this projection always exists even though HeadStart; is a binary variable. 
The vector z; contains z; and at least one factor affecting Head Start participation 
that does not have a direct effect on score. One possibility is distance to the nearest 
Head Start center. In this example we would probably be willing to assume that 
E(u; |z;) = 0, since the test score equation is structural, but we would only want to 
assume E(zjuj2) = 0, since the Head Start equation is a linear projection involving a 
binary dependent variable. Correlation between uy and u) means HeadStart is 
endogenous in equation (8.7). 


Both of the previous examples can be written for observation i as 
Ya = Xap, + ua, (8.9) 
Vin = XnBy + un, (8.10) 


which looks just like a two-equation SUR system but where x; and Xn can contain 
endogenous as well as exogenous variables. Because x; and Xp are generally corre- 
lated with u; and uj, estimation of these equations by OLS or FGLS, as we studied 
in Chapter 7, will generally produce inconsistent estimators. 


210 Chapter 8 


We already know one method for estimating an equation such as equation (8.9): 
if we have sufficient instruments, apply 2SLS. Often 2SLS produces acceptable 
results, so why should we go beyond single-equation analysis? Not surprisingly, our 
interest in system methods with endogenous explanatory variables has to do with 
efficiency. In many cases we can obtain more efficient estimators by estimating fı and 
B, jointly, that is, by using a system procedure. The efficiency gains are analogous 
to the gains that can be realized by using feasible GLS rather than OLS in an SUR 
system. 


8.2 General Linear System of Equations 


We now discuss estimation of a general linear model of the form 
y; = Xf + ui, (8.11) 


where y; is a G x 1 vector, X; is a G x K matrix, and u; is the G x 1 vector of errors. 
This model is identical to equation (7.9), except that we will use different assump- 
tions. In writing out examples, we will often omit the observation subscript i, but 
for the general analysis, carrying it along is a useful notational device. As in Chapter 
7, the rows of y;, X;, and u; can represent different time periods for the same cross- 
sectional unit (so G = T, the total number of time periods). Therefore, the following 
analysis applies to panel data models where T is small relative to the cross section 
sample size, N; for an example, see Problem 8.8. We cover general panel data appli- 
cations in Chapter 11. (As in Chapter 7, the label “systems of equations” is not es- 
pecially accurate for basic panel data models because we have only one behavioral 
equation over T different time periods.) 
The following orthogonality condition is the basis for estimating £: 


ASSUMPTION SIV.1: E(Z;u;) = 0, where Z; is a G x L matrix of observable instru- 
mental variables. 


(The abbreviation SIV stands for “system instrumental variables.”) For the purposes 
of discussion, we assume that E(u;) = 0; this assumption is almost always true in 
practice anyway. 

From what we know about IV and 2SLS for single equations, Assumption SIV.1 
cannot be enough to identify the vector f. An assumption sufficient for identification 
is the rank condition: 


ASSUMPTION SIV.2: rank E(Z;X;) = K. 
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Assumption SIV.2 generalizes the rank condition from the single-equation case. 
(When G = 1, Assumption SIV.2 is the same as Assumption 2SLS.2b.) Since E(Z;X;) 
is an L x K matrix, Assumption SIV.2 requires the columns of this matrix to be lin- 
early independent. Necessary for the rank condition is the order condition: L > K. 
We will investigate the rank condition in detail for a broad class of models in Chapter 
9. For now, we just assume that it holds. 

As in Chapter 7, it is useful to carry along two examples. The first applies to si- 
multaneous equations models and other systems with endogenous explanatory vari- 
ables. The second is a panel data model where instrumental variables are specified 
that are not the same as the explanatory variables. 

For the first example, write a G equation system for the population as 


yı = Xf; +m 
(8.12) 
Yo = Xcbg + Ua, 


where, for each equation g, x, is a 1 x K, vector that can contain both exogenous 
and endogenous variables. For each g, P, is Ky x 1. Because this looks just like the 
SUR system from Chapter 7, we will refer to it as an SUR system, keeping in mind 
the crucial fact that some elements of x, are thought to be correlated with u, for at 
least some g. 

For each equation we assume that we have a set of instrumental variables, a 1 x L, 
vector z,, that are exogenous in the sense that 


E(zjug) = 0, g= l; 2y thg G. (8.13) 


In most applications, unity is an element of z, for each g, so that E(ug) = 0, all g. As 
we will see, and as we already know from single-equation analysis, if x, contains 
some elements correlated with ug, then z, must contain more than just the exogenous 
variables appearing in equation g. Much of the time the same instruments, which 
consist of all exogenous variables appearing anywhere in the system, are valid for 
every equation, so that z, = w (say), g = 1,2, ..., G. Some applications require us to 
have different instruments for different equations, so we allow that possibility here. 
Putting an i subscript on the variables in equations (8.12), and defining 


yi xı 0 0 --- 0 Uil 


yn 0 x2 0. 0 ui 
y =|] . b X =]. . fs u =|]. (8.14) 
Gx1 : GxK s ; Gx1 : 
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and $ = (B),B5,..., Bi)’, we can write equation (8.12) in the form (8.11). Note that 
K = Kı + K +- -- + Kg is the total number of parameters in the system. 
The matrix of instruments has a structure similar to X;: 


zı 00.. 0 
0 z0.. 0 

z=|. h (8.15) 
0 0 0 `. ZG 


which has dimension G x L, where L = Lı + Lz +---+ Le. Then, for each i, 
Zlu; = (Zui, Zour, ...,2iGuig)’ , (8.16) 


and so E(Z;u;) = 0 reproduces the orthogonality conditions (8.13). Also, 


E(z},Xi1) 0 0 -:- 0 
0 E(zjX2) 0 -> 0 
R(X) =| |, (8.17) 
0 0 0 --- E(zicgxic) 


where E(z;,xj,) is Ly x Kj. Assumption SIV.2 requires that this matrix have full col- 
umn rank, where the number of columns is K = Kı + Ko + --- + Kg. A well-known 
result from linear algebra says that a block diagonal matrix has full column rank if 
and only if each block in the matrix has full column rank. In other words, Assump- 
tion SIV.2 holds in this example if and only if 


rank E(z;,Xig) = Ky, g=1,2,...,G. (8.18) 


This is exactly the rank condition needed for estimating each equation by 2SLS, 
which we know is possible under conditions (8.13) and (8.18). Therefore, identifica- 
tion of the SUR system is equivalent to identification equation by equation. This 
reasoning assumes that the f, are unrestricted across equations. If some prior 
restrictions are known, then identification is more complicated, something we cover 
explicitly in Chapter 9. 

In the important special case where the same instruments, say w;, can be used for 
every equation, we can write definition (8.15) as Z; = Ig © wi. 

For the panel data model 


Vit = Xuß + uit, PS lea (8.19) 


we set G = T and define the T x K matrix as in Chapter 7: 
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X=] |: (8.20) 


As the model is written in (8.19), there is a common vector, f, in each time period. 
Nevertheless, as we discussed in Chapter 7, this notation allows for different in- 
tercepts and slopes because x; can contain time period dummies and interactions 
between time dummies and covariates (whether those covariates change over time or 
not). 

If z is a 1 x L, vector of instruments for each time period, in the sense that 


E(z uit) = 0, cal E i (8.21) 


then we can define the matrix of instruments as in (8.15), with the slight notational 
change G = T. When L; is the same for all t (L; = L, t= 1,..., T), a different choice 
is possible: 


Zi=| . |. (8.22) 


As we will see, generally the choice (8.22) leads to a different estimator than when the 
IVs are chosen as in (8.15). 


8.3 Generalized Method of Moments Estimation 


8.3.1 General Weighting Matrix 


The orthogonality conditions in Assumption SIV.1 suggest an estimation strategy. 
Under Assumptions SIV.1 and SIV.2, p is the unique K x 1 vector solving the linear 
set population moment conditions 


E[Z;(y; — Xip)] = 0. (8.23) 


(That £ is a solution follows from Assumption SIV.1; that it is unique follows by 
Assumption SIV.2.) In other words, if b is any other K x 1 vector (so that at least one 
element of b is different from the corresponding element in £), then 
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E[Zi(y, — Xib)] 40. (8.24) 


This equation shows that $ is identified. Because sample averages are consistent 
estimators of population moments, the analogy principle applied to condition (8.23) 
suggests choosing the estimator f to solve 


N 
NS“ Zi; — XÂ) = 0. (8.25) 
i=l 


Equation (8.25) is a set of L linear equations in the K unknowns in f. First consider 
the case L = K, so that we have exactly enough IVs for the explanatory variables in 
the system. Then, if the K x K matrix }>", ZX; is nonsingular, we can solve for B as 


. N =l N 
B= (m 5z) (x Sz) (8.26) 
i=1 i=l 


We can write Ê using full matrix notation as Ê = (Z'X)'Z'Y, where Z is the NG x L 
matrix obtained by stacking Z; from i=1,2,...,N, X is the NG x K matrix 
obtained by stacking X; from i= 1,2,..., N, and Y is the NG x 1 vector obtained 
from stacking y;, i = 1,2,..., M. We call equation (8.26) the system IV (SIV) esti- 
mator. Application of the law of large numbers shows that the SIV estimator is con- 
sistent under Assumptions SIV.1 and SIV.2. 

When L > K—so that we have more columns in the IV matrix Z; than we need for 
identification—choosing Î is more complicated. Except in special cases, equation 
(8.25) will not have a solution. Instead, we choose B to make the vector in equation 
(8.25) as “small” as possible in the sample. One idea is to minimize the squared 
Euclidean length of the L x 1 vector in equation (8.25). Dropping the 1/N, this 
approach suggests choosing Ê to make 


'T HN 


>, Z;(y; — Xf) 


i=l 


N 
5 Ziy; — X;ĝ) 
i=l 


as small as possible. While this method produces a consistent estimator under 
Assumptions SIV.1 and SIV.2, it rarely produces the best estimator, for reasons we 
will see in Section 8.3.3. 

A more general class of estimators is obtained by using a weighting matrix in the 
quadratic form. Let W be an L x L symmetric, positive semidefinite matrix, where 
the “^” is included to emphasize that W is generally an estimator. A generalized 
method of moments (GMM) estimator of £ is a vector Ê that solves the problem 


System Estimation by Instrumental Variables 215 


1 


Ww 


N 


5 Z; (y; — Xib) | 


i=] 


N 
mia [X> Zis,- (8.27) 
i=] 
Because expression (8.27) is a quadratic function of b, the solution to it has a closed 
form. Using multivariable calculus or direct substitution, we can show that the unique 
solution is 


B = (X'ZWZ'X) | (X'ZWZ'Y), (8.28) 


assuming that X'ZWZ’X is nonsingular. To show that this estimator is consistent, we 
assume that W has a nonsingular probability limit. 


ASSUMPTION SIV.3: WW as N => coo, where W is a nonrandom, symmetric, 
L x L positive definite matrix. 


In applications, the convergence in Assumption SIV.3 will follow from the law of 
large numbers because W will be a function of sample averages. The fact that W is 
assumed to be positive definite means that W is positive definite with probability 
approaching one (see Chapter 3). We could relax the assumption of positive defi- 
niteness to positive semidefiniteness at the cost of complicating the assumptions. In 
most applications, we can assume that W is positive definite. 


THEOREM 8.1 (Consistency of GMM): Under Assumptions SIV.1—SIV.3, Ê 2, Bas 
Proof: Write 


j- [iee ge] (beers) 


Plugging in y; = X;f + u; and doing a little algebra gives 


(eje Sa (ieaie) 


Under Assumption SIV.2, C = E(Z;X;) has rank K, and combining this with As- 
sumption SIV.3, C'WC has rank K and is therefore nonsingular. It follows by the 
law of large numbers that plim Ê = £ + (C'WC)'C'W(plim N! 7%, Z/u;) = B+ 
(CWO 'C'W-0=8. 


B=Bt+ 


Theorem 8.1 shows that a large class of estimators is consistent for f under As- 
sumptions SIV.1 and SIV.2, provided that we choose W to satisfy modest restrictions. 
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When L = K, the GMM estimator in equation (8.28) becomes equation (8.26), n 
matter how we choose W, because X’Z is a K x K nonsingular matrix. 

We can also show that Ê is asymptotically normally distributed under these first 
three assumptions. 


THEOREM 8.2 (Asymptotic Normality of GMM): Under Assumptions SIV.1—SIV.3, 
VN(B — P) is asymptotically normally distributed with mean zero and 


Avar VN(B — B) = (C'WC)'C'WAWC(C'WC) |, (8.29) 
where 
A= E(Z;uju;Z;) = Var(Ziuj). (8.30) 


We will not prove this theorem in detail as it can be reasoned from 


VN(B — B) 
a N i N 
-|( p>. a (x ayz x) (x Soxvz,)w (we zu), 
i=l i=l 
where we use the fact that N-12 55], Zlu; Ż Normal(0, A). The asymptotic vari- 
ance matrix in equation (8.29) looks complicated, but it can be consistently esti- 


mated. If Â is a consistent estimator of A—more on this later—then equation (8.29) 
is consistently estimated by 


(X'Z/NW(Z'X/ N) (X'Z/ N) WAÑ(Z'X/N)(X'Z/N)Ŵ(Z'X/ N) (8.31) 


As usual, we estimate Avar(f) by dividing expression (8.31) by N. 

While the general formula (8.31) is occasionally useful, it turns out that it is greatly 
simplified by choosing W appropriately. Since this choice also (and not coinciden- 
tally) gives the asymptotically efficient estimator, we hold off discussing asymptotic 
variances further until we cover the optimal choice of W in Section 8.3.3. 


8.3.2 System Two-Stage Least Squares Estimator 


A choice of W that leads to a useful and familiar-looking estimator is 


w- (x oa z) = (Z'Z/N)' (8.32) 


which is a consistent estimator of [E(Z; Zl. Assumption SIV.3 simply requires that 
E(Z;Z;) exist and be nonsingular, and these requirements are not very restrictive. 
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When we plug equation (8.32) into equation (8.28) and cancel N everywhere, we get 
B = [X'Z(Z'Z)'Z'X]'X'Z(Z'Z)“'Z'Y. (8.33) 


This looks just like the single-equation 2SLS estimator, and so we call it the system 
2SLS (S2SLS) estimator. 

When we apply equation (8.33) to the system of equations (8.12), with definitions 
(8.14) and (8.15), we get something very familiar. As an exercise, you should show 
that Ê produces 2SLS equation by equation. (The proof relies on the block diagonal 
structures of Z; Z; and Z;X; for each i.) In other words, we estimate the first equation 
by 2SLS using instruments z;, the second equation by 2SLS using instruments zj2, 
and so on. When we stack these into one long vector, we get equation (8.33). 

The interpretation of the system 2SLS estimator for the panel data model differs 
depending on the choice of instrument matrix. When the instrument matrix is stacked 
as in (8.22)—which means that we have the same number of IVs for each time 
period—the S2SLS estimator reduces to the pooled 2SLS (P2SLS) estimator. In 
other words, one treats the stacked panel data as a long cross section and performs 
standard 2SLS. If Z; takes the form in (8.15), the S2SLS estimator has a different 
characterization. Namely, we run a first-stage regression separately for each t, xj; on 
Zin i=1,...,N, and obtain the fitted values, $; = Zill;, where Ñ, is L; x K. Then 
we obtain the pooled IV estimator using Š; (which is 1 x K) as the IVs for x. The 
key difference in the two approaches is that the P2SLS estimator pools across time in 
estimating the reduced form, while the second procedure estimates a separate reduced 
form for each ¢t. See Problem 8.8 for verification of these claims, or see Wooldridge 
(200Se). 

In the next subsection we will see that the system 2SLS estimator is not necessarily 
the asymptotically efficient estimator. Still, it is /N-consistent and easy to compute 
given the data matrices X, Y, and Z. This latter feature is important because we need 
a preliminary estimator of f to obtain the asymptotically efficient estimator. 


8.3.3 Optimal Weighting Matrix 


Given that a GMM estimator exists for any positive definite weighting matrix, it is 
important to have a way of choosing among all of the possibilities. It turns out that 
there is a choice of W that produces the GMM estimator with the smallest asymp- 
totic variance. 

We can appeal to expression (8.29) for a hint as to the optimal choice of W. It is 
this expression we are trying to make as small as possible, in the matrix sense. (See 
Definition 3.11 for the definition of relative asymptotic efficiency.) The expression 
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(8.29) simplifies to (C'A~'C)~! if we set W = A~!. Using standard arguments from 
matrix algebra, it can be shown that (C'WC)'C/WAWC(C’WC)! — (C‘A'C)! 
is positive semidefinite for any L x L positive definite matrix W. The easiest way to 
prove this point is to show that 


(C’A~'C) — (C'WC)(C'WAWC) '(C’WC) (8.34) 
is positive semidefinite, and we leave this proof as an exercise (see Problem 8.5). This 
discussion motivates the following assumption and theorem. 

ASSUMPTION SIV.4: W = A™', where A is defined by expression (8.30). 


THEOREM 8.3 (Optimal Weighting Matrix): Under Assumptions SIV.1—SIV.4, the 
resulting GMM estimator is efficient among all GMM estimators of the form (8.28). 


Provided that we can consistently estimate A, we can obtain the asymptotically effi- 
cient GMM estimator. Any consistent estimator of A delivers the efficient GMM es- 
timator, but one estimator is commonly used that imposes no structure on A. 


Procedure 8.1 (GMM with Optimal Weighting Matrix): 


a. Let Š be an initial consistent estimator of $. In most cases this is the system 2SLS 
estimator. 


b. Obtain the G x 1 residual vectors 
i;=y,-Xf, i=1,2,...,N (8.35) 


c. A generally consistent estimator of A is A= N! YÀ | Zù Z. 
d. Choose 


N —1 
WwW = Av! = (x 5 zanz) (8.36) 
l 


and use this matrix to obtain the asymptotically optimal GMM estimator. 


The estimator of A in part c of Procedure 8.1 is consistent for E(Z‘u,;u/Z;) under 
general conditions. When each row of Z; and u; represent different time periods—so 
that we have a single-equation panel data model—the estimator A allows for arbi- 
trary heteroskedasticity (conditional or unconditional), as well as arbitrary serial de- 
pendence (conditional or unconditional). The reason we can allow this generality 
is that we fix the row dimension of Z; and u; and let N — œ. Therefore, we are 
assuming that N, the size of the cross section, is large enough relative to T to make 
fixed T asymptotics sensible. (This is the same approach we took in Chapter 7.) With 
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N very large relative to T, there is no need to downweight correlations between time 
periods that are far apart, as in the Newey and West (1987) estimator applied to time 
series problems. Ziliak and Kniesner (1998) do use a Newey-West type procedure in a 
panel data application with large N. Theoretically, this is not required, and it is not 
completely general because it assumes that the underlying time series are weakly de- 
pendent. (See Wooldridge (1994a) for discussion of weak dependence in time series 
contexts.) A Newey-West type estimator might improve the finite-sample perfor- 
mance of the GMM estimator. 
The asymptotic variance of the optimal GMM estimator is estimated as 


=I 


=f 
(X’Z) (>. Zawi (ZX), (8.37) 
i=1 


where û; = y; — XB; asymptotically, it makes no difference whether the first-stage 
residuals ù; are used in place of u;. The square roots of diagonal elements of this 
matrix are the asymptotic standard errors of the optimal GMM estimator. This esti- 
mator is called a minimum chi-square estimator, for reasons that will become clear in 
Section 8.5.2. 

When Z; = X; and the û; are the system OLS residuals, expression (8.37) becomes 
the robust variance matrix estimator for SOLS [see expression (7.28)]. This expres- 
sion reduces to the robust variance matrix estimator for FGLS when Z; = Q7'X; and 
the û; are the FGLS residuals [see equation (7.52)]. 


8.3.4 The Generalized Method of Moments Three-Stage Least Squares Estimator 


The GMM estimator using weighting matrix (8.36) places no restrictions on either 
the unconditional or conditional (on Z;) variance matrix of u;: we can obtain the 
asymptotically efficient estimator without making additional assumptions. Neverthe- 
less, it is still common, especially in traditional simultaneous equations analysis, to 
assume that the conditional variance matrix of u; given Z; is constant. This assump- 
tion leads to a system estimator that is a middle ground between system 2SLS and the 
always-efficient minimum chi-square estimator. 

The GMM three-stage least squares (GMM 3SLS) estimator (or just 3SLS when 
the context is clear) is a GMM estimator that uses a particular weighting matrix. To 
define the 3SLS estimator, let i; = y; — XB be the residuals from an initial estima- 
tion, usually system 2SLS. Define the G x G matrix 


N 
Q= NTS > wi. (8.38) 
i=! 
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Using the same arguments as in the FGLS case in Section 7.5.1, @ > Q = E(uju'). 
The weighting matrix used by 3SLS is 


-1 
W= (m5 zaz) = [Z'(Iy Q Q)Z/NT"', (8.39) 
i=l 


where Iy is the N x N identity matrix. Plugging this into equation (8.28) gives the 
3SLS estimator 


B = [X'Z{Z' (Iy O Ê)ZY'Z'X X Z{Z' (Ly Q9 Q)Z}'Z'Y. (8.40) 


By Theorems 8.1 and 8.2, Ê is consistent and asymptotically normal under Assump- 
tions SIV.1-SIV.3. Assumption SIV.3 requires E(Z;QZ;) to be nonsingular, a stan- 
dard assumption. 

The term “three-stage least squares” is used here for historical reasons, and it will 
make more sense in Section 8.4. Generally, the GMM 3SLS estimator is simply given 
by formula equation (8.40); its computation does not require three stages. 

Because the 3SLS estimator is simply a GMM estimator with a particular weight- 
ing matrix, its consistency follows from Theorem 8.1 without additional assumptions. 
(And the 3SLS estimator is also v N-asymptotically normal.) But a natural ques- 
tion to ask is, When is 3SLS an efficient GMM estimator? The answer is simple. 
First, note that equation (8.39) always consistently estimates [E(Z/QZ,)|-'. There- 
fore, from Theorem 8.3, equation (8.39) is an efficient weighting matrix provided 
E(Z;QZ;) = A = E(Ziuju/Z,). 


ASSUMPTION SIV.5:  E(Zjuju/Z;) = E(Z/QZ;), where Q = E(uju'). 


Assumption SIV.5 is the system extension of the homoskedasticity assumption for 
2SLS estimation of a single equation. A sufficient condition for Assumption SIV.5, 
and one that is easier to interpret, is 


E(u;u; | Z;) = E(uju'). (8.41) 


We do not take equation (8.41) as the homoskedasticity assumption because there are 
interesting applications where Assumption SIV.5 holds but equation (8.41) does not 
(more on this topic in Chapters 9 and 11). When 


E(u; |Z) =0 (8.42) 


is assumed in place of Assumption SIV.1, then equation (8.41) is equivalent to 
Var(u;|Z;) = Var(u;). Whether we state the assumption as in equation (8.41) or use 
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the weaker form, Assumption SIV.5, it is important to see that the elements of the 
unconditional variance matrix Q are not restricted: a; = Var(u,) can change across 
g, and og, = Cov(ug, un) can differ across g and h. 

The system homoskedasticity assumption (8.41) necessarily holds when the instru- 
ments Z; are treated as nonrandom and Var(u;) is constant across i. Because we are 
assuming random sampling, we are forced to properly focus attention on the variance 
of u; conditional on Zi. 

For the system of equations (8.12) with instruments defined in the matrix (8.15), 
Assumption SIV.5 reduces to (without the 7 subscript) 


E(ugunZ4Zn) = E(ugun)E(Zz; Zn), g,h=1,2,...,G. (8.43) 
Therefore, ugun must be uncorrelated with each of the elements of Z, Zh. When g = A, 
assumption (8.43) becomes 


E(u} 2,24) = E(u} Ez, Zg), (8.44) 


so that u? is uncorrelated with each element of z,, along with the squares and cross 
products of the z, elements. This is exactly the homoskedasticity assumption for 
single-equation IV analysis (Assumption 2SLS.3). For g # h, assumption (8.43) is 
new because it involves covariances across different equations. 

Assumption SIV.5 implies that Assumption SIV.4 holds (because the matrix (8.39) 
consistently estimates A~! under Assumption SIV.5). Therefore, we have the follow- 
ing theorem: 


THEOREM 8.4 (Optimality of 3SLS): Under Assumptions SIV.1, SIV.2, SIV.3, and 
SIV.5, the 3SLS estimator is an optimal GMM estimator. Further, the appropriate 


estimator of Avar( f) is 
-1 


-1 
(X'Z) (Szaz) (Z'X)| = [X'Z{Z' (iy 9 Q)Z}'Z'x]!. (8.45) 
i=l 


It is important to understand the implications of this theorem. First, without As- 
sumption SIV.5, the 3SLS estimator is generally less efficient, asymptotically, than 
the minimum chi-square estimator, and the asymptotic variance estimator for 3SLS 
in equation (8.45) is inappropriate. Second, even with Assumption SIV.5, the 3SLS 
estimator is no more asymptotically efficient than the minimum chi-square estimator: 
expressions (8.36) and (8.39) are both consistent estimators of A~' under Assumption 
SIV.5. In other words, the estimators based on these two different choices for W are 
VN -equivalent under Assumption SIV.5. 
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Given the fact that the GMM estimator using expression (8.36) as the weighting 
matrix is never worse, asymptotically, than 3SLS, and in some important cases is 
strictly better, why is 3SLS ever used? There are at least two reasons. First, 3SLS has 
a long history in simultaneous equations models, whereas the GMM approach has 
been around only since the early 1980s, starting with the work of Hansen (1982) and 
White (1982b). Second, the 3SLS estimator might have better finite-sample properties 
than the optimal GMM estimator when Assumption SIV.5 holds. However, whether 
it does or not must be determined on a case-by-case basis. 

There is an interesting corollary to Theorem 8.4. Suppose that in the system (8.11) 
we can assume E(X; © u;) = 0, which is a strict form of exogenecity from Chapter 7. 
We can use a method of moments approach to estimating f, where the instruments 
for each equation, x?, is the row vector containing every row of X;. As shown by Im, 
Ahn, Schmidt, and Wooldridge (1999), the GMM 3SLS estimator using instruments 
Zi = Ig ® x? is equal to the feasible GLS estimator that uses the same Ò. Therefore, 
if Assumption SIV.5 holds with Z; = Ic ® x?, FGLS is asymptotically efficient in the 
class of GMM estimators that use the orthogonality condition in Assumption 
SGLS.1. Sufficient for Assumption SIV.5 in the GLS context is the homoskedasticity 
assumption E(u;u; | X;) = Q. 


8.4 Generalized Instrumental Variables Estimator 


In this section we study a system IV estimator that is based on transforming the 
original set of moment conditions using a GLS-type transformation. As we will see 
in Section 8.6, such an approach can lead to an asymptotically efficient estimator. 
The cost is that, in some cases, the transformation may lead to moment conditions 
that are no longer valid, leading to inconsistency. 


8.4.1 Derivation of the Generalized Instrumental Variables Estimator and Its 
Asymptotic Properties 


Rather than estimating f using the moment conditions E[Z;(y; — X;B)] = 0, an al- 
ternative is to transform the moment conditions in a way analogous to generalized 
least squares. Typically, a GLS-like transformation is used in the context of a system 
homoskedasticity assumption such as (8.41), but we can analyze the properties of 
the estimator without any assumptions other than existence of the second moments. 
We simply let Q = E(u,u/) as before, and we assume Q is positive definite. Then we 
transform equation (8.11), just as with a GLS analysis: 


Q'?y, =O ?X p+ O07 u; (8.46) 
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Because we think some elements of X; are endogenous, we apply system instrumental 
variables to equation (8.46). In particular, we use the system 2SLS estimator with 
instruments Q~! Z;; that is, we transform the instruments in the same way that we 
transform all other variables in (8.46). The resulting estimator can be written out with 
full data matrices as 


Bow = {[X' (ly ® Q')Z][Z' (Ly @ Q) [Z Ly @ A") X]}! 
-[X'(Iy @ Q7!)Z]IZ' (ly 8 @!')Z)"[Z' (Ly & @")Y), (8.47) 


where, for example, Z' (Iy @ Q7')X = ey Z'Q-'X;. If we plug in Y = Xf + U and 
divide the last term by the sample size, then we see that the condition needed 
for consistency of ĝ is N~! y 1 Z'O 'u; 2, 0. For completeness, a set of conditions 
for consistency and asymptotic normality of the GIV estimator is listed. The first as- 
sumption is the key orthogonality condition: 


ASSUMPTION GIV.1: E(Z; & u;) = 0. 


Assumption GIV.1 requires that every element of the instrument matrix is un- 
correlated with every element of the vector of errors. When Z; = X;, Assumptions 
GIV.1 and SGLS.1 are identical. For consistency, we can relax Assumption GIV.1 to 


E(Z/Q"'u;) = 0, (8.48) 


provided we know Q or use a consistent estimator of Q. There are two reasons 
we adopt the stronger Assumption GIV.1. First, Assumption GIV.1 ensures that 
replacing Q with a consistent estimator does not affect the V N-asymptotic distribu- 
tion of the GIV estimator. (Essentially the same issue arises with feasible GLS.) 
Second, Assumption GIV.1 implies that using an inconsistent estimator of Q (for 
example, if we incorrectly impose restrictions on Q) does not cause inconsistency in 
the GIV estimator. Adopting equation (8.48) would require us to provide separate 
arguments for asymptotic normality in the realistic case that Q is replaced with Q, 
and even for consistency if we allow for inconsistent estimators of Q. 

An important case where equation (8.48) holds but Assumption GIV.1 might not is 
when Q (or, at least Q) is diagonal, in which case we should focus on equation (8.48). 
See, for example, Problem 8.8. (Because Q is diagonal, we do not have to adjust 
the asymptotic variance of the GIV estimator for the estimation of Ê.) Because As- 
sumption GIV.1 allows for a simpler unified treatment, and because it is often im- 
plicit in applications of GIV, we adopt it here. 

Naturally, we also need a rank condition: 


ASSUMPTION GIV.2: (a) rank E(Z/Q7'Z,) = L; (b) rank E(Z!Q7'X;) = K. 


224 Chapter 8 


When G=1, Assumption GIV.2 reduces to Assumption 2SLS.2 from Chapter 5. 
Typically, GIV.2(a) holds when the instruments exclude exact linear dependencies. 
Assumption GIV.2(b) requires that, after the GLS-like transformation, there are 
enough instruments partially correlated with the endogenous elements of X;. 

When Q in (8.47) is replaced with a consistent estimator, Ê, we obtain the gener- 
alized instrumental variables (GIV) estimator. Typically, would be chosen as in 
equation (8.38) after an initial S2SLS estimation. We have argued that the estimator 
is consistent under Assumptions GIV.1 and GIV.2; the details of replacing Q with Q 
are very similar to the FGLS case covered in Chapter 7. 

Implicit in GIV estimation is the first-stage regression Q-'/?X; on Ô! Z;, which 
yields fitted values Q7'/?Z,II*, where T* = [Z'(Iy ® Q7!)Z]"'[Z' (Iy @ Q"')X. OF 
course, the first-stage regression need not be carried out to compute the GIV esti- 
mator, but, in comparing GIV with other estimators, it can be useful to think of GIV 
as system IV estimation of Q- Wy = =x; ip Ôu; using IVs ôD". 

In Chapter 7, we noted that GLS oe Q is not a diagonal matrix must be used 
with caution. Assumption GIV.1 can impose unintended restrictions on the relation- 
ships between instruments and errors across equations, or time, or both. Users need 
to keep in mind that the GIV estimator generally requires stronger assumptions than 
E(Z;u;) = 0 for consistency. 

As with FGLS, there is a “standard” assumption under which the GIV estimator 
has a simplified asymptotic variance. The assumption is an extension of Assumption 
SGLS.3 in Chapter 7. 


ASSUMPTION GIV.3: E(Z!Q u; uQ Z;) = E(Z/Q7'Z,). 


A sufficient condition for GIV.3 is (8.41), the same assumption that is sufficient for 
the GMM 3SLS estimator to use the optimal weighting matrix. Under Assumptions 
GIV.1, GIV.2, and GIV.3, it is easy to show 


Avar[VN (Bary — B)] = {E(X; Q7 Z) [E(Z Q7 Z)  E(Z O7 X;)y', (8.49) 


and this matrix is easily estimated by the usual process of replacing expectations with 
sample averages and Q with Q. 


8.4.2 Comparison of Generalized Method of Moment, Generalized Instrumental 
Variables, and the Traditional Three-Stage Least Squares Estimator 


Especially in estimating standard simultaneous equations models, a different kind of 
GLS transformation is often used. Rather than use an FGLS estimator in the first 
stage, as is implicit in GIV, the traditional 3SLS estimator is typically motivated as 
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follows. The first stage uses untransformed X; and Z;, giving fitted values X= ZA, 
where Ñ = (Z’Z)'Z'X is the matrix of first-stage regression coefficients. Then, 
equation (8.46) is estimated by system IV, but with instruments Q~'/?X;. When we 
replace Q with its estimate, we arrive at the estimator 


N lyn 
Brists = (> vax) 2 vay) ; (8.50) 
i=l i=l 
where the subscript “T3SLS” denotes “traditional” 3SLS. Substituting y, = X; + u; 
and rearranging shows that the orthogonality condition needed for consistency is 


E(Zm)' Qu] = N'E(Z/Q"'u)] = 0, (8.51) 


where TI = plim(II). Because of the presence of I’, condition (8.51) is not quite the 
same as Assumption GIV.1, but it would be a fluke if (8.51) held but GIV.1 did not. 
Like the GIV estimator, the consistency of the traditional 3SLS estimator does not 
follow from E(Z;u;) = 0. 

We have now discussed three different estimators of a system of equations based 
on estimating the variance matrix Q = E(uju’). Why have seemingly different esti- 
mators been given the label “three-stage least squares”? The answer is simple: In 
the setting that the traditional 3SLS estimator was proposed—system (8.12), with the 
same instruments used in every equation—all estimates are identical. In fact, the 
equivalence of all estimates holds if we just impose the common instrument assump- 
tion. Let w; denote a vector assumed to be exogenous in every equation in the sense 
that E(w'uig) = 0 for g = 1,...,G. In other words, any variable exogenous in one 
equation is exogenous in all equations. For a system such as (8.12), this means that 
the same instruments can be used for every equation. In a panel data setting, it means 
the chosen instruments are strictly exogenous. In either case, it makes sense to choose 
the instrument matrix as 


Zi = lc @wi, (8.52) 


which is a special case of (8.15). It follows from the results of Im, Ahn, Schmidt, and 
Wooldridge (1999) that the GMM 3SLS estimator and the GIV estimator (using the 
same Ê) are identical. Further, it can be shown that the GIV estimator and the tra- 
ditional 3SLS estimator are identical. (This follows because the first-stage regressions 
involve the same set of explanatory variables, w;, and so it does not matter whether 
the matrix of first-stage regression coefficients is obtained via FGLS, which is what 
GIV does, or system OLS, which is what T3SLS does.) 
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In many modern applications of system IV methods to both simultaneous 
equations and panel data, instruments that are exogenous in one equation are not 
exogenous in all other equations. In such cases it is important to use the GMM 3SLS 
estimator once Z; has been properly chosen. Of course, the minimum chi-square es- 
timator that does not impose SIV.5 is always available, too. The GIV estimator and 
the traditional 3SLS estimator generally induce correlation between the transformed 
instruments and the structural errors. For this reason, we will tend to focus on GMM 
methods based on the original orthogonality conditions. Nevertheless, particularly in 
Chapter 11, we will see that the GIV approach can provide insights into the workings 
of certain panel data estimators while also affording computational simplicity. 


8.5 Testing Using Generalized Method of Moments 


8.5.1 Testing Classical Hypotheses 


Testing hypotheses after GMM estimation is straightforward. Let Ê denote a GMM 
estimator, and let V denote its estimated asymptotic variance. Although the following 
analysis can be made more general, in most applications we use an optimal GMM 
estimator. Without Assumption SIV.5, the weighting matrix would be expression 
(8.36) and V would be as in expression (8.37). This can be used for computing t sta- 
tistics by obtaining the asymptotic standard errors (square roots of the diagonal 
elements of V). Wald statistics of linear hypotheses of the form Ho: Rf = r, where R 
isa Q x K matrix with rank Q, are obtained using the same statistic we have already 
seen several times. Under Assumption SIV.5 we can use the 3SLS estimator and its 
asymptotic variance estimate in equation (8.45). For testing general system hypoth- 
eses, we would probably not use the 2SLS estimator, because its asymptotic variance 
is more complicated unless we make very restrictive assumptions. 

An alternative method for testing linear restrictions uses a statistic based on the dif- 
ference in the GMM objective function with and without the restrictions imposed. To 
apply this statistic, we must assume that the GMM estimator uses the optimal weighting 
matrix, so that W consistently estimates [Var(Z/u;)] |. Then, from Lemma 3.8, 


N f N 
(reS ziw Jw (arte zia) ~ Xi (8.53) 
i=1 i=l 


since Z/u; is an L x 1 vector with zero mean and variance A. If W does not con- 
sistently estimate [Var(Z/u,)] , then result (8.53) is false, and the following method 
does not produce an asymptotically chi-square statistic. 
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Let Ê again be the GMM estimator, using optimal weighting matrix W, obtained 
without imposing the restrictions. Let J be the GMM estimator using the same 
weighting matrix W but obtained with the Q linear restrictions imposed. The restricted 
estimator can always be obtained by estimating a linear model with K — Q rather 
than K parameters. Define the unrestricted and restricted residuals as û; = y; — Xp 
and u; = y; — X;ĝ, respectively. It can be shown that, under Ho, the GMM distance 
statistic has a limiting chi-square distribution: 


(>: za jw (>: x - (>. za) W (>: x 


See, for example, Hansen (1982) and Gallant (1987). The GMM distance statistic is 
simply the difference in the criterion function (8.27) evaluated at the restricted and 
unrestricted estimates, divided by the sample size, N. For this reason, expression 
(8.54) is called a criterion function statistic. Because constrained minimization cannot 
result in a smaller objective function than unconstrained minimization, expression 
(8.54) is always nonnegative and usually strictly positive. 

Under Assumption SIV.5 we can use the 3SLS estimator, in which case expression 
(8.54) becomes 


N ‘¢7n VIN N IIN \ TN 
2 x) (>. zioz (>: za - (>. x ts zioz (>: x) 
i=] i=1 {=l i=1 i=1 i=1 


(8.55) 


IN N x0: (8.54) 


where Q would probably be computed using the 2SLS residuals from estimating the 
unrestricted model. The division by N has disappeared because of the definition of 
W; see equation (8.39). 

Testing nonlinear hypotheses is easy once the unrestricted estimator ĝ has been 


obtained. Write the null hypothesis as 
Ho: e(f) = 0, (8.56) 


where ¢(f) = [ci(B), c2(B),---,¢co(B)]’ is a Q x 1 vector of functions. Let C(f) de- 
note the Q x K Jacobian of c( f). Assuming that rank C(f) = Q, the Wald statistic is 


W = e(B)'(CVC')'e(B), (8.57) 


where C = C() is the Jacobian evaluated at the GMM estimate £. Under Ho, the 
Wald statistic has an asymptotic Xo distribution. 
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8.5.2 Testing Overidentification Restrictions 


Just as in the case of single-equation analysis with more exogenous variables than 
explanatory variables, we can test whether overidentifying restrictions are valid in a 
system context. In the model (8.11) with instrument matrix Z;, where X; is G x K and 
Z; is G x L, there are overidentifying restrictions if L > K. Assuming that W is an 
optimal weighting matrix, it can be shown that 


N l N 
(meS za )w(aey za) X XK (8.58) 
i=l i=1 


under the null hypothesis Ho: E(Z/u;) = 0. The asymptotic v7_, distribution is sim- 
ilar to result (8.53), but expression (8.53) contains the unobserved errors, u;, whereas 
expression (8.58) contains the residuals, û;. Replacing u; with û; causes the degrees of 
freedom to fall from L to L — K: in effect, K orthogonality conditions have been used 
to compute ĝ, and L — K are left over for testing. 

The overidentification test statistic in expression (8.58) is just the objective function 
(8.27) evaluated at the solution # and divided by N. It is because of expression (8.58) 
that the GMM estimator using the optimal weighting matrix is called the minimum 
chi-square estimator: B is chosen to make the minimum of the objective function have 
an asymptotic chi-square distribution. If W is not optimal, expression (8.58) fails to 
hold, making it much more difficult to test the overidentifying restrictions. When 
L = K, the left-hand side of expression (8.58) is identically zero; there are no over- 
identifying restrictions to be tested. 

Under Assumption SIV.5, the 3SLS estimator is a minimum chi-square estimator, 
and the overidentification statistic in equation (8.58) can be written as 


N IN -1 / N 
E x (>. zoz (>. x (8.59) 
i=1 i=l i=l 


Without Assumption SIV.5, the limiting distribution of this statistic is not chi square. 

In the case where the model has the form (8.12), overidentification test statistics 
can be used to choose between a systems and a single-equation method. For example, 
if the test statistic (8.59) rejects the overidentifying restrictions in the entire system, 
then the 3SLS estimators of the first equation are generally inconsistent. Assuming 
that the single-equation 2SLS estimation passes the overidentification test discussed 
in Chapter 6, 2SLS would be preferred. However, in making this judgment it is, as 
always, important to compare the magnitudes of the two sets of estimates in addition 
to the statistical significance of test statistics. Hausman (1983, p. 435) shows how to 
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construct a statistic based directly on the 3SLS and 2SLS estimates of a particular 
equation (assuming that 3SLS is asymptotically more efficient under the null), and 
this discussion can be extended to allow for the more general minimum chi-square 
estimator. 


8.6 More Efficient Estimation and Optimal Instruments 


In Section 8.3.3 we characterized the optimal weighting matrix given the matrix Z; of 
instruments. But this discussion begs the question of how we can best choose Z;. In 
this section we briefly discuss two efficiency results. The first has to do with adding 
valid instruments. 

To be precise, let Z; be a G x Lı submatrix of the G x L matrix Z;, where Z; 
satisfies Assumptions SIV.1 and SIV.2. We also assume that Z; satisfies Assumption 
SIV.2; that is, E(Z;,X;) has rank K. This assumption ensures that £ is identified using 
the smaller set of instruments. (Necessary is Lı > K.) Given Zi, we know that the 
efficient GMM estimator uses a weighting matrix that is consistent for Ay, where 
A; = E(Z;,uju/Z;,). When we use the full set of instruments Z; = (Zi, Z2), the op- 
timal weighting matrix is a consistent estimator of A given in expression (8.30). Can 
we say that using the full set of instruments (with the optimal weighting matrix) is 
better than using the reduced set of instruments (with the optimal weighting matrix)? 
The answer is that, asymptotically, we can do no worse, and often we can do better, 
using a larger set of valid instruments. 

The proof that adding orthogonality conditions generally improves efficiency pro- 
ceeds by comparing the asymptotic variances of VN (p — f) and /N( Ê — fp), where 
the former estimator uses the restricted set of IVs and the latter uses the full set. 
Then 


Avar VN(B — p) — Avar VN(B — B) = (CIAT!) | — (CAIO, (8.60) 


where Cı = E(Z/,X;). The difference in equation (8.60) is positive semidefinite if and 
only if C/A~'C — C| A7 'C; is p.s.d. The latter result is shown by White (2001, Prop- 
osition 4.51) using the formula for partitioned inverse; we will not reproduce it here. 

The previous argument shows that we can never do worse asymptotically by add- 
ing instruments and computing the minimum chi-square estimator. But we need not 
always do better. The proof in White (2001) shows that the asymptotic variances of B 
and ĝ are identical if and only if 


C = E(Z},uju/Z)Ay!Ci, (8.61) 
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where C2 = E(Z;,X;). Generally, this condition is difficult to check. However, if we 
assume that E(Z/uju/Z;) = 0? E(Z/Z;)—the ideal assumption for system 2SLS—then 
condition (8.61) becomes 


E(Zi,Xi) = E(Z,Za)[E(ZiZa))'E(Z)X:). 
Straightforward algebra shows that this condition is equivalent to 
E[(Zi2 — Zi D1)'X;] = 0, (8.62) 


where D; = [E(Z}, Za)  E(Z} Zn) is the Lı x L) matrix of coefficients from the 
population regression of Zn on Za. Therefore, condition (8.62) has a simple inter- 
pretation: X; is orthogonal to the part of Zn that is left after netting out Za. This 
statement means that Zp is not partially correlated with X;, and so it is not useful as 
instruments once Z; has been included. 

Condition (8.62) is very intuitive in the context of 2SLS estimation of a single 
equation. Under E(u?z/z;) = o7E(z!z;), 2SLS is the minimum chi-square estimator. 
The elements of z; would include all exogenous elements of x;, and then some. If, say, 
Xix is the only endogenous element of x;, condition (8.62) becomes 


L(xix | Zi, Z2) = L(x;x | Za), (8.63) 


so that the linear projection of x;x onto z; depends only on z;;. If you recall how the 
IVs for 2SLS are obtained—by estimating the linear projection of x;x on z; in the first 
stage—it makes perfectly good sense that Zz can be omitted under condition (8.63) 
without affecting efficiency of 2SLS. 

In the general case, if the error vector u; contains conditional heteroskedasticity, or 
correlation across its elements (conditional or otherwise), condition (8.61) is unlikely 
to be true. As a result, we can keep improving asymptotic efficiency by adding 
more valid instruments. Whenever the error term satisfies a zero conditional mean 
assumption, unlimited IVs are available. For example, consider the linear model 
E(y|x) = xf, so that the error u = y — xf has a zero mean given x. The OLS esti- 
mator is the IV estimator using IVs zı = x. The preceding efficiency result implies 
that, if Var(w|x) # Var(w), there are unlimited minimum chi-square estimators that 
are asymptotically more efficient than OLS. Because E(u |x) = 0, h(x) is a valid set 
of IVs for any vector function h(-). (Assuming, as always, that the appropriate 
moments exist.) Then, the minimum chi-square estimate using IVs z = [x,h(x)] is 
generally more asymptotically efficient than OLS. (Chamberlain, 1982, and Cragg, 
1983, independently obtained this result.) If Var(y |x) is constant, adding functions 
of x to the IV list results in no asymptotic improvement because the linear projection 
of x onto x and h(x) obviously does not depend on h(x). 
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Under homoskedasticity, adding moment conditions does not reduce the asymp- 
totic efficiency of the minimum chi-square estimator. Therefore, it may seem that, 
when we have a linear model that represents a conditional expectation, we cannot 
lose by adding IVs and performing minimum chi-square. (Plus, we can then test the 
functional form E(y|x) = xf by testing the overidentifying restrictions.) Unfortu- 
nately, as shown by several authors, including Tauchen (1986), Altonji and Segal 
(1996), and Ziliak (1997), GMM estimators that use many overidentifying restric- 
tions can have very poor finite sample properties. 

The previous discussion raises the following possibility: rather than adding more 
and more orthogonality conditions to improve on inefficient estimators, can we find a 
small set of optimal IVs? The answer is yes, provided we replace Assumption SIV.1 
with a zero conditional mean assumption. 


ASSUMPTION SIV.1/:  E(ujg| wy) = 0, g = 1,..., G for some vector wy. 


Assumption SIV.1’ implies that w; is exogenous in every equation, and each element 
of the instrument matrix Z; can be any function of w;. 


THEOREM 8.5 (Optimal Instruments): Under Assumption SIV.1’ (and sufficient reg- 
ularity conditions), the optimal choice of instruments is Z* = Q(w;) E(X; |w;), 
where Q(w;) = E(u/u; | w;), provided that rank E(Z"X;) = K. 


We will not prove Theorem 8.5 here. We discuss a more general case in Section 14.5; 
see also Newey and McFadden (1994, Section 5.4). Theorem 8.5 implies that, if the 
G x K matrix Z; were available, we would use it in equation (8.26) in place of Z; to 
obtain the SIV estimator with the smallest asymptotic variance. This would take the 
arbitrariness out of choosing additional functions of z; to add to the IV list: once we 
have Z;,, all other functions of w; are redundant. 

Theorem 8.5 implies that if Assumption SIV.1’, the system homoskedasticity as- 
sumption (8.41), and E(X;|w;) = (Ic © w;) = ZII hold, then the optimal instru- 
ments are simply Z* = Q-'(Z,I1). But this choice of instruments leads directly to the 
traditional 3SLS estimator in equation (8.50) when Q and II are replaced by their 
/N-consistent estimators. As we discussed in Section 8.4.2, this estimator is identical 
to the GIV estimator and GMM 3SLS estimator. 

If E(u;|X;) =0 and E(uju/|X;) = Q, then the optimal instruments are Q7'X;, 
which gives the GLS estimator. Replacing Q by Q has no effect asymptotically, and 
so the FGLS is the SIV estimator with optimal choice of instruments. 

Without further assumptions, both Q(w;) and E(X; | w;) can be arbitrary functions 
of w;, in which case the optimal SIV estimator is not easily obtainable. It is possible 
to find an estimator that is asymptotically efficient using nonparametric estimation 
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methods to estimate Q(w;) and E(X; |w;), but there are many practical hurdles to 
overcome in applying such procedures. See Newey (1990) for an approach that 
approximates E(X; |w;) by parametric functional forms, where the approximation 
gets better as the sample size grows. 


8.7 Summary Comments on Choosing an Estimator 


Throughout this chapter we have commented on the robustness and efficiency of 
different estimators. It is useful here to summarize the considerations behind choos- 
ing among estimators for systems or panel data applications. 

Generally, if we have started with moment conditions of the form E(Z/‘u;,) = 0, 
GMM estimation based on this set of moment conditions will be more robust than 
estimators based on a transformed set of moment conditions, such as GIV. If we 
decide to use GMM, we can use the unrestricted weighting matrix, as in (8.36), or we 
might use the GMM 3SLS estimator, which uses weighting matrix (8.39). Under 
Assumption SIV.5, which is a system homoskedasticity assumption, the 3SLS esti- 
mator is asymptotically efficient. In an important special case, where the instruments 
can be chosen as in (8.52), GMM 3SLS, GIV, and traditional 3SLS are identical. 
When GMM and GIV are both consistent but are not /N-asymptotically equivalent, 
they cannot generally be ranked in terms of asymptotic efficiency. 

One of the efficiency results of the previous section is that one can never do worse 
by adding instruments and using the efficient weighting matrix in GMM. This has 
implications for panel data applications. For example, if one has the option of 
choosing the instruments as in equation (8.22) or equation (8.15) (with G = T), the 
efficient GMM estimator using equation (8.15) is no less efficient, asymptotically, 
than the efficient GMM estimator using equation (8.22). This follows because we can 
obtain (8.22) as a linear combination of (8.15), and using a linear combination is 
operationally the same as using a restricted set of instruments. 

What about the choice between the S2SLS and the GMM 3SLS estimators? Under 
the assumptions of Theorem 8.4, GMM 3SLS is asymptotically no less efficient than 
S2SLS. Nevertheless, it is useful to know that there are situations where S2SLS and 
GMM 3SLS coincide. 

The first is easy: when the general system (8.11) is just identified, that is, L = K, all 
estimators reduce to the IV estimator in equation (8.26). In the case of the SUR sys- 
tem (8.12), the system is just identified if and only if each equation is just identified: 
L, = Ky, g = 1,..., G and the rank condition holds for each equation. 

When estimating system (8.12), there is another case where S2SLS—which, recall, 
reduces to 2SLS estimation of each equation—coincides with 3SLS, regardless of the 
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degree of overidentification. The 3SLS estimator is equivalent to 2SLS equation by 
equation when Q is a diagonal matrix, that is, Ê = diag(6?, 63,...,62.); see Problem 
8.7. 

The algebraic equivalence of system 2SLS and 3SLS for estimating (8.12) when co) 
is a diagonal matrix allows us to conclude that S2SLS and 3SLS are asymptotically 
equivalent when Q is diagonal. The reason is simple. If we could use Q in the 3SLS 
estimator, then it would be identical to 2SLS equation by equation. The actual 3SLS 
estimator, which uses Ê, is VN-equivalent to the hypothetical 3SLS estimator that 
uses Q. Therefore, 3SLS and 2SLS are V/N-equivalent. 

Even in cases where 2SLS on each equation is not algebraically or asymptotically 
equivalent to 3SLS, it is not necessarily true that we should prefer the 3SLS estimator 
(or the minimum chi-square estimator more generally). Why? Suppose primary in- 
terest lies in estimating the parameters of the first equation, fı. On the one hand, 
we know that 2SLS estimation of this equation produces consistent estimators under 
the orthogonality conditions E(z{u1) = 0 and the condition rank E(z}x;) = Ky. For 
consistency, we do not care what is happening elsewhere in the system as long as 
these two assumptions hold. On the other hand, the GMM 3SLS and minimum chi- 
square estimators of f; are generally inconsistent unless E(zjuq) = 0 forg =1,...,G. 
(But we do not need to assume E(zjuj,) = 0 for g # h as we would to apply GIV.) 
Therefore, in using GMM to consistently estimate f}, all equations in the system 
must be properly specified, which means that the instruments must be exogenous in 
their corresponding equations. Such is the nature of system estimation procedures. As 
with system OLS and FGLS, there is a trade-off between robustness and efficiency. 


Problems 


8.1. a. Show that the GMM estimator that solves the problem (8.27) satisfies the 
first-order condition 


N ' (N 3 
(> ux) Ww (>: Z!(y;,— xo) =0 
i=l i=l 
b. Use this expression to obtain formula (8.28). 
8.2. Consider the system of equations 
Y; = XiP+ u; 


where i indexes the cross section observation, y; and u; are G x 1, X; is G x K, Z; is the 
G x L matrix of instruments, and £ is K x 1. Let Q = E(u;u;). Make the following 
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four assumptions: (1) E(Z;u;) = 0; (2) rank E(Z/X;) = K; (3) E(Z/Z;) is nonsingular; 
and (4) E(Z;QZ,;) is nonsingular. 

a. What are the asymptotic properties of the 3SLS estimator? 

b. Find the asymptotic variance matrix of VN (ssis — £). 

c. How would you estimate Avar( 35) 5)? 


8.3. Let x be a1 x K random vector and let z be a 1 x M random vector. Suppose 
that E(x|z) = L(x|z) = zII, where II is an M x K matrix; in other words, the ex- 
pectation of x given z is linear in z. Let h(z) be any 1 x Q nonlinear function of z, and 
define an expanded instrument list as w = [z, h(z)]. 

a. Show that rank E(z’x) = rank E(w’x). {Hint: First show that rank E(z’x) = 
rank E(z’x*), where x* is the linear projection of x onto z; the same holds with z 
replaced by w. Next, show that when E(x|z) = L(x|z), L[x|z,h(z)] = L(x|z) for 
any function h(z) of z.} 

b. Explain why the result from part a is important for identification with IV estima- 
tion. 


8.4. Consider the system of equations (8.12), and let w be a row vector of vari- 
ables exogenous in every equation. Assume that the exogeneity assumption takes the 
stronger form E(uy|w) =0, g=1,2,...,G. This assumption means that w and 
nonlinear functions of w are valid instruments in every equation. 


a. Suppose that E(x, |w) is linear in w for all g. Show that adding nonlinear func- 
tions of z to the instrument list cannot help in satisfying the rank condition. (Hint: 
Apply Problem 8.3.) 


b. What happens if E(x, | w) is a nonlinear function of w for some g? 


8.5. Verify that the difference (C’A~'C) — (C'WC)(C/'WAWC)!(C’WC) in ex- 
pression (8.34) is positive semidefinite for any symmetric positive definite matrices W 
and A. (Hint: Show that the difference can be expressed as 


CA, — D(D'D) D'AC 


where D = A!/?WC. Then, note that for any L x K matrix D, I, — D(D'D)"'D’ is a 
symmetric, idempotent matrix, and therefore positive semidefinite.) 


8.6. Consider the system (8.12) in the G = 2 case, with an i subscript added: 


Va = Xapi + ua, 


Yo = Xah + un. 
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The instrument matrix is 


Let Q be the 2 x 2 variance matrix of u; = (un, un)’, and write 
ate gu > 
Zhao o2 
a. Find E(Z/Q 'u;) and show that it is not necessarily zero under the orthogonality 
conditions E(z; u; ) = 0 and E(z/,uj.) = 0. 
b. What happens if Q is diagonal (so that Q~! is diagonal)? 
c. What if za = zp (without restrictions on Q)? 


8.7. With definitions (8.14) and (8.15), show that system 2SLS and 3SLS are 
numerically identical whenever Q is a diagonal matrix. 


8.8. Consider the standard panel data model 
Vit = XB + Ui, (8.64) 


where the 1 x K vector x; might have some elements correlated with u;,. Let Zy be a 
1 x L vector of instruments, L > K, such that E(ziuj;,) = 0, t= 1,2,...,7. (In prac- 
tice, z;, would contain some elements of x;,, including a constant and possibly time 
dummies.) 

a. Write down the system 2SLS estimator if the instrument matrix is Z; = 
(z!,,Z!5,-.-,Z}p)' (a T x L matrix). Show that this estimator is a pooled 2SLS esti- 
mator. That is, it is the estimator obtained by 2SLS estimation of equation (8.64) 
using instruments Z;, pooled across all 7 and £. 

b. What is the rank condition for the pooled 2SLS estimator? 

c. Without further assumptions, show how to estimate the asymptotic variance of the 
pooled 2SLS estimator. 

d. Without further assumptions, how would you estimate the optimal weighting 
matrix? Be very specific. 


e. Show that the assumptions 
E(uir | Zit, Ui, 1-15 Zi, t—-1, - - - , Uil, Za) = 9, t= Peeve a (8.65) 
E(u? | Zi) = 0°, t=1,...,T (8.66) 
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imply that the usual standard errors and test statistics reported from the pooled 2SLS 
estimation are valid. These assumptions make implementing 2SLS for panel data 
very simple. 

f. What estimator would you use under condition (8.65) but where we relax condi- 
tion (8.66) to E(w? | Z+) = E(uż) = 07, t=1,...,7? This approach will involve an 
initial pooled 2SLS estimation. 


g. Suppose you choose the instrument matrix as Z; = diag(zj1,Z2,...,Zir). Show 
that the system 2SLS estimator can be obtained as follows. (1) For each ż, regress X; 
on Z;,i=1,...,7, and obtain the fitted values, X;,. (2) Compute the pooled IV esti- 
mator using $; as IVs for Xy. 


h. What is the most efficient way to use the moment conditions E(zi,uj;) = 0, 
t= Lovey T? 


8.9. Consider the single-equation linear model from Chapter 5: y= xß +u. 
Strengthen Assumption 2SLS.1 to E(u|z) = 0 and Assumption 2SLS.3 to E(u? | z) = 
a°, and keep the rank condition 2SLS.2. Show that if E(x |z) = zII for some L x K 
matrix I, the 2SLS estimator uses the optimal instruments based on the orthogon- 
ality condition E(u|z) = 0. What does this result imply about OLS if E(u|x) = 0 
and Var(u|x) = 07? 


8.10. In the model from Problem 8.8, let a, = yi; — XÊ be the residuals after pooled 
2SLS estimation. 


a. Consider the following test for AR(1) serial correlation in {u;: t = 1,..., T}: es- 
timate the auxiliary equation 


Vit = Xuf + pûi 1—1 + error, f= 2,005 1 tH lye N 


by 2SLS using instruments (z;;, ù; +1), and use the ¢ statistic on p. Argue that, if we 
strengthen (8.56) to E(tti | Zit, Xi,t-1, Ui, t—1, Zi,t-1, Xi,r-2,--- Xi, Uil, Zi) = 0, then the 
heteroskedasticity-robust ¢ statistic for p is asymptotically valid as a test for serial 
correlation. (Hint: Under the dynamic completeness assumption (8.56), which is 
effectively the null hypothesis, the fact that ù; ;-; is used in place of uj;,;-1 does not 
affect the limiting distribution of p; see Section 6.1.3.) What is the homoskedasticity 
assumption that justifies the usual ¢ statistic? 


b. What should be done to obtain a heteroskedasticity-robust test? 


8.11. a. Use Theorem 8.5 to show that, in the single-equation model 


Yı = 20, + %)24+ U, 
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with E(w; | z) = 0—where z; is a strict subset of z—and Var(u; |z) = a7, the optimal 
instrumental variables are [z;, E( y» | z)]. 

b. If y, is a binary variable with P(y, = 1|z) = F(z) for some known function F(-), 
0 < F(z) < 1, what are the optimal IVs? 

8.12. Suppose in the system (8.11) we think Q has a special form, and so we esti- 
mate a restricted version of it. Let A be a G x G positive semidefinite matrix such 
that plim(A) = A # Q. 

a. If we apply GMM 3SLS using A, is the resulting estimator generally consistent for 
p? Explain. 

b. If Assumption SIV.5 holds, but A #4 Q, can you use the difference in criterion 
functions for testing? 


c. Let Q be the unrestricted estimator given by (8.38), and suppose the restrictions 
imposed in obtaining A hold: A = Q. Is there any loss in asymptotic efficiency in 
using Q rather than A ina GMM 3SLS analysis? Explain. 


8.13. Consider a model where exogenous variables interact with an endogenous 
explanatory variable: 


Vi = Mı +210) + %1y2 + 21927) + u1 

E(u; |z) = 0, 

where z is the vector of all exogenous variables. Assume, in addition, (1) 
Var(u |z) = o? and (2) E(y2|z) = za. 

a. Apply Theorem 8.5 to obtain the optimal instrumental variables. 

b. How would you operationalize the optimal IV estimator? 

8.14. Write a model for panel data with potentially endogenous explanatory vari- 
ables as 

Yin = Na + Zin + YinG + Uin, t=1,...,T, 


where y, denotes a set of intercepts for each ¢, Z; is 1 x L1, and y,,. is 1 x Gi. Let Zy 
be the 1 x L vector, L = Lı + L2, such that E(zjuin) =0, t= 1,...,T. We need 
Li > G. 

a. A reduced form for yj. can be written as yj. = Zull + Vin. Explain how to use 
pooled OLS to obtain a fully robust test of Ho : E(yj,uin) = 0. (The test should have 
G degrees of freedom.) 
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b. Extend the approach based on equation (6.32) to obtain a test of overidentifying 
restrictions for the pooled 2SLS estimator. You should describe how to make the test 
fully robust to serial correlation and heteroskedasticity. 


8.15. Use the data in AIRFARE.RAW to answer this question. 
a. Estimate the passenger demand model 
log(passenj,) = Po + 0198, + 62y99; + 6300, + L; log( farer) + Pa log(dist;) 

+ Bs[log(dist;)]* + uin 
by pooled OLS. Obtain the usual standard error and the fully robust standard error 
for fı pots. Interpret the coefficient. 
b. Explicitly test {uj : t= 1,...,4} for AR(1) serial correlation. What do you make 
of the estimate of p? 
c. The variable concen; can be used as an IV for log(fare;,). Estimate the reduced 
form for log( fare;,) (by pooled OLS) and test whether log(fare;,) and concen; are 
sufficiently partially correlated. You should use a fully robust test. 
d. Estimate the model from part a by pooled 2SLS. Is By, posts Practically different 
from $i pors? How do the pooled and fully robust standard errors for fi pos;5 
compare? 
e. Under what assumptions can you directly compare By, posts and By. pors using the 
usual standard errors to obtain a Hausman statistic? Is it likely these assumptions are 
satisfied? 


f. Use the test from Problem 8.14 to formally test that log(fare;,) is exogenous in the 
demand equation. Use the fully robust version. 


9 Simultaneous Equations Models 


9.1 Scope of Simultaneous Equations Models 


The emphasis in this chapter is on situations where two or more variables are jointly 
determined by a system of equations. Nevertheless, the population model, the iden- 
tification analysis, and the estimation methods apply to a much broader range of 
problems. In Chapter 8, we saw that the omitted variables problem described in Ex- 
ample 8.2 has the same statistical structure as the true simultaneous equations model 
in Example 8.1. In fact, any or all of simultaneity, omitted variables, and measure- 
ment error can be present in a system of equations. Because the omitted variable and 
measurement error problems are conceptually easier—and it was for this reason that 
we discussed them in single-equation contexts in Chapters 4 and 5—our examples 
and discussion in this chapter are geared mostly toward true simultaneous equations 
models (SEMs). 

For effective application of true SEMs, we must understand the kinds of situations 
suitable for SEM analysis. The labor supply and wage offer example, Example 8.1, 
is a legitimate SEM application. The labor supply function describes individual be- 
havior, and it is derivable from basic economic principles of individual utility max- 
imization. Holding other factors fixed, the labor supply function gives the hours of 
labor supply at any potential wage facing the individual. The wage offer function 
describes firm behavior, and, like the labor supply function, the wage offer function is 
self-contained. 

When an equation in an SEM has economic meaning in isolation from the other 
equations in the system, we say that the equation is autonomous. One way to think 
about autonomy is in terms of counterfactual reasoning, as in Example 8.1. If we 
know the parameters of the labor supply function, then, for any individual, we can 
find labor hours given any value of the potential wage (and values of the other 
observed and unobserved factors affecting labor supply). In other words, we could, in 
principle, trace out the individual labor supply function for given levels of the other 
observed and unobserved variables. 

Causality is closely tied to the autonomy requirement. An equation in an SEM 
should represent a causal relationship; therefore, we should be interested in varying 
each of the explanatory variables—including any that are endogenous—while hold- 
ing all the others fixed. Put another way, each equation in an SEM should represent 
some underlying conditional expectation that has a causal structure. What compli- 
cates matters is that the conditional expectations are in terms of counterfactual vari- 
ables. In the labor supply example, if we could run a controlled experiment where we 
exogenously varied the wage offer across individuals, then the labor supply function 
could be estimated without ever considering the wage offer function. In fact, in the 
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absence of omitted variables or measurement error, ordinary least squares would be 
an appropriate estimation method. 

Generally, supply and demand examples satisfy the autonomy requirement, re- 
gardless of the level of aggregation (individual, household, firm, city, and so on), and 
simultaneous equations systems were originally developed for such applications. (See, 
for example, Haavelmo (1943) and Kiefer’s (1989) interview of Arthur S. Goldberger.) 
Unfortunately, many recent applications of SEMs fail the autonomy requirement; as 
a result, it is difficult to interpret what has actually been estimated. Examples that fail 
the autonomy requirement often have the same feature: the endogenous variables in 
the system are all choice variables of the same economic unit. 

As an example, consider an individual’s choice of weekly hours spent in legal 
market activities and hours spent in criminal behavior. An economic model of crime 
can be derived from utility maximization; for simplicity, suppose the choice is only 
between hours working legally (work) and hours involved in crime (crime). The fac- 
tors assumed to be exogenous to the individual’s choice are things like wage in legal 
activities, other income sources, probability of arrest, expected punishment, and so 
on. The utility function can depend on education, work experience, gender, race, and 
other demographic variables. 

Two structural equations fall out of the individual’s optimization problem: one has 
work as a function of the exogenous factors, demographics, and unobservables; the 
other has crime as a function of these same factors. Of course, it is always possible 
that factors treated as exogenous by the individual cannot be treated as exogenous by 
the econometrician: unobservables that affect the choice of work and crime could 
be correlated with the observable factors. But this possibility is an omitted variables 
problem. (Measurement error could also be an important issue in this example.) 
Whether or not omitted variables or measurement error are problems, each equation 
has a causal interpretation. 

In the crime example, and many similar examples, it may be tempting to stop be- 
fore completely solving the model—or to circumvent economic theory altogether— 
and specify a simultaneous equations system consisting of two equations. The first 
equation would describe work in terms of crime, while the second would have crime 
as a function of work (with other factors appearing in both equations). While it is 
often possible to write the first-order conditions for an optimization problem in this 
way, these equations are not the structural equations of interest. Neither equation can 
stand on its own, and neither has a causal interpretation. For example, what would it 
mean to study the effect of changing the market wage on hours spent in criminal 
activity, holding hours spent in legal employment fixed? An individual will generally 
adjust the time spent in both activities to a change in the market wage. 
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Often it is useful to determine how one endogenous choice variable trades off against 
another, but in such cases the goal is not—and should not be—to infer causality. For 
example, Biddle and Hamermesh (1990) present OLS regressions of minutes spent 
per week sleeping on minutes per week working (controlling for education, age, and 
other demographic and health factors). Biddle and Hamermesh recognize that there 
is nothing “‘structural’’ about such an analysis. (In fact, the choice of the dependent 
variable is largely arbitrary.) Biddle and Hamermesh (1990) do derive a structural 
model of the demand for sleep (along with a labor supply function) where a key ex- 
planatory variable is the wage offer. The demand for sleep has a causal interpreta- 
tion, and it does not include labor supply on the right-hand side. 

Why are SEM applications that do not satisfy the autonomy requirement so 
prevalent in applied work? One possibility is that there appears to be a general 
misperception that “structural” and “simultaneous” are synonymous. However, we 
already know that structural models need not be systems of simultaneous equations. 
And, as the crime/work example shows, a simultaneous system is not necessarily 
structural. 


9.2 Identification in a Linear System 


9.2.1 Exclusion Restrictions and Reduced Forms 


Write a system of linear simultaneous equations for the population as 
Vt = Yoa) + 2d) + 

(9.1) 
Ve = Yaa) + UHI + Ue, 


where Yon) is | x Gh, Yin) is G, X 1, Zn) is 1 x My, and 6) is Ma x 1, h= 1,2,...,G. 
These are structural equations for the endogenous variables y,, y2,..., yg. We will 
assume that, if the system (9.1) represents a true SEM, then equilibrium conditions 
have been imposed. Hopefully, each equation is autonomous, but, of course, they do 
not need to be for the statistical analysis. 

The vector yq, denotes endogenous variables that appear on the right-hand side of 
the Ath structural equation. By convention, yg) can contain any of the endogenous 
variables y,, ,---,¥g except for y,. The variables in Z; are the exogenous variables 
appearing in equation h. Usually there is some overlap in the exogenous variables 
across different equations; for example, except in special circumstances, each Zq) 
would contain unity to allow for nonzero intercepts. The restrictions imposed in 
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system (9.1) are called exclusion restrictions because certain endogenous and exoge- 
nous variables are excluded from some equations. 
The 1 x M vector of all exogenous variables z is assumed to satisfy 


E(z'u,) = 0, g= 1, 2i4655G: (9.2) 


When all of the equations in system (9.1) are truly structural, we are usually willing to 
assume 


E(u, |z) = 0, g= lL lG (9.3) 


However, we know from Chapters 5 and 8 that assumption (9.2) is sufficient for 
consistent estimation. Sometimes, especially in omitted variables and measurement 
error applications, one or more of the equations in system (9.1) will simply represent 
a linear projection onto exogenous variables, as in Example 8.2. It is for this reason 
that we use assumption (9.2) for most of our identification and estimation analysis. 
We assume throughout that E(z’z) is nonsingular, so that there are no exact linear 
dependencies among the exogenous variables in the population. 

Assumption (9.2) implies that the exogenous variables appearing anywhere in the 
system are orthogonal to all the structural errors. If some elements in, say, zq) do not 
appear in the second equation, then we are explicitly assuming that they do not enter 
the structural equation for y,. If there are no reasonable exclusion restrictions in an 
SEM, it may be that the system fails the autonomy requirement. 

Generally, in the system (9.1), the error ug in equation g will be correlated with yy) 
(we show this correlation explicitly later), and so OLS and GLS will be inconsistent. 
Nevertheless, under certain identification assumptions, we can estimate this system 
using the instrumental variables procedures covered in Chapter 8. 

In addition to the exclusion restrictions in system (9.1), another possible source of 
identifying information is on the G x G variance matrix & = Var(u). For now, È is 
unrestricted and therefore contains no identifying information. 

To motivate the general analysis, consider specific labor supply and demand func- 
tions for some population: 


h (w) =) log() + 21) 9(1) + uy 
h? (eo) = p log(ee) + 2(2)6(2) + u, 


where æ is the dummy argument in the labor supply and labor demand functions. 
We assume that observed hours, A, and observed wage, w, equate supply and demand: 


h=h'(w) = hf (w). 
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The variables in zq) shift the labor supply curve, and 22) contains labor demand 
shifters. By defining yı = h and y, = log(w) we can write the equations in equilib- 
rium as a linear simultaneous equations model: 


Yı = 12 + 2901) + u1, (9.4) 


Yı = V22 + 2(2)6(2) + u2. (9.5) 


Nothing about the general system (9.1) rules out having the same variable on the left- 
hand side of more than one equation. 

What is needed to identify the parameters in, say, the supply curve? Intuitively, 
since we observe only the equilibrium quantities of hours and wages, we cannot dis- 
tinguish the supply function from the demand function if zq) and zo) contain exactly 
the same elements. If, however, Z) contains an element not in z;;) —that is, if there is 
some factor that exogenously shifts the demand curve but not the supply curve—then 
we can hope to estimate the parameters of the supply curve. To identify the demand 
curve, we need at least one element in zq) that is not also in Zo). 

To formally study identification, assume that y; # y3; this assumption just means 
that the supply and demand curves have different slopes. Subtracting equation (9.5) 
from equation (9.4), dividing by y, — yı, and rearranging gives 


Y2 = Z1)F21 + 22) 822 + V2, (9.6) 


where m1 = d(1)/(Y2 — 71), %22 = —8(2)/(Y2 — 71), and v2 = (u — u2)/(72 — 71). This 
is the reduced form for y, because it expresses y, as a linear function of all of the 
exogenous variables and an error v2 which, by assumption (9.2), is orthogonal to all 
exogenous variables: E(z'v2) = 0. Importantly, the reduced form for y, is obtained 
from the two structural equations (9.4) and (9.5). 

Given equation (9.4) and the reduced form (9.6), we can now use the identification 
condition from Chapter 5 for a linear model with a single right-hand-side endogenous 
variable. This condition is easy to state: the reduced form for y, must contain at least 
one exogenous variable not also in equation (9.4). This means there must be at least 
one element of zp) not in za) with coefficient in equation (9.6) different from zero. 
Now we use the structural equations. Because m2) is proportional to ĝo), the condi- 
tion is easily restated in terms of the structural parameters: in equation (9.5) at least 
one element of Zo) not in zq) must have nonzero coefficient. In the supply and de- 
mand example, identification of the supply function requires at least one exogenous 
variable appearing in the demand function that does not also appear in the supply 
function; this conclusion corresponds exactly with our earlier intuition. 
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The condition for identifying equation (9.5) is just the mirror image: there must be 
at least one element of zq) actually appearing in equation (9.4) that is not also an 
element of Z(2). 


Example 9.1 (Labor Supply for Married Women): Consider labor supply and de- 
mand equations for married women, with the equilibrium condition imposed: 


hours = yı log(wage) + 019 + ôi1educ + 0\2age + 0\3kids + dysothine + u. 


hours = y, log(wage) + 629 + 621educ + d22exper + u2. 


The supply equation is identified because, by assumption, exper appears in the de- 
mand function (assuming 62 # 0) but not in the supply equation. The assumption 
that past experience has no direct effect on labor supply can be questioned, but it has 
been used by labor economists. The demand equation is identified provided that at 
least one of the three variables age, kids, and othinc actually appears in the supply 
equation. 


We now extend this analysis to the general system (9.1). For concreteness, we study 
identification of the first equation: 


Yı = Yaa) + Zaða) + u1 = Xafa) + u, (9.7) 


where the notation used for the subscripts is needed to distinguish an equation with 
exclusion restrictions from the general equation that we study in Section 9.2.2. 
Assuming that the reduced forms exist, write the reduced form for yq) as 

Ya) = Ala) + vay, (9.8) 


where E[z’v(;)] = 0. Further, define the M x M, matrix selection matrix Sq), which 
consists of zeros and ones, such that zq) = 2S). The rank condition from Chapter 5, 
Assumption 2SLS.2b, can be stated as 


rank E[z’x(1)| = Ki, (9.9) 


where K; = Gi + Mı. But E[z’x()] = Efz' (za), zS1))] = E(z’z) (M1) | S1)]. Since we 
always assume that E(z’z) has full rank M, assumption (9.9) is the same as 


rank({IT(1) |S] = Gi + Mı. (9.10) 


In other words, [Ma) | S(1)] must have full column rank. If the reduced form for y1) 
has been found, this condition can be checked directly. But there is one thing we can 
conclude immediately: because [IT(1) |S(1)] is an M x (Gi + M1) matrix, a necessary 
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condition for assumption (9.10) is M > Gi + Mi, or 
M— M, >G. (9.11) 


We have already encountered condition (9.11) in Chapter 5: the number of exoge- 
nous variables not appearing in the first equation, M — M1, must be at least as great 
as the number of endogenous variables appearing on the right-hand side of the first 
equation (9.7), G;. This is the order condition for identification of the first equation. 
We have proven the following theorem: 


THEOREM 9.1 (Order Condition with Exclusion Restrictions): In a linear system of 
equations with exclusion restrictions, a necessary condition for identifying any par- 
ticular equation is that the number of excluded exogenous variables from the equa- 
tion must be at least as large as the number of included right-hand-side endogenous 
variables in the equation. 


It is important to remember that the order condition is only necessary, not suffi- 
cient, for identification. If the order condition fails for a particular equation, there is 
no hope of estimating the parameters in that equation. If the order condition is met, 
the equation might be identified. 


9.2.2 General Linear Restrictions and Structural Equations 


The identification analysis of the preceding subsection is useful when reduced forms 

are appended to structural equations. When an entire structural system has been 

specified, it is best to study identification entirely in terms of the structural parameters. 
To this end, we now write the G equations in the population as 


yy, + zd; +u = 0 


: (9.12) 
Yq + 26g + ug = 0, 

where y = (y1, Y2,---, Yg) is the 1 x G vector of all endogenous variables and z = 
(21,---,Z) is still the 1 x M vector of all exogenous variables, and probably con- 


tains unity. We maintain assumption (9.2) throughout this section and also assume 
that E(z’z) is nonsingular. The notation here differs from that in Section 9.2.1. Here, 
7, is Gx 1 and 6, is M x 1 for all g = 1,2,...,G, so that the system (9.12) is the 
general linear system without any restrictions on the structural parameters. 

We can write this system compactly as 


yl +zA+u=0, (9.13) 
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where u = (u,..., ug) is the 1 x G vector of structural errors, I is the G x G matrix 
with gth column y,, and A is the M x G matrix with gth column ôg. So that a reduced 
form exists, we assume that I is nonsingular. Let & = E(u’u) denote the Gx G 
variance matrix of u, which we assume to be nonsingular. At this point, we have 
placed no other restrictions on F, A, or È. 

The reduced form is easily expressed as 


y =2(-AT') + u(r!) = z + vy, (9.14) 


where M = (—AI~') and v =u(-I!). Define A = E(v'v) = FET! as the re- 
duced form variance matrix. Because E(z'v) = 0 and E(z’z) is nonsingular, II and A 
are identified because they can be consistently estimated given a random sample on y 
and z by OLS equation by equation. The question is, under what assumptions can we 
recover the structural parameters T, A, and & from the reduced form parameters? 

It is easy to see that, without some restrictions, we will not be able to identify any 
of the parameters in the structural system. Let F be any G x G nonsingular matrix, 
and postmultiply equation (9.13) by F: 


yl F + zAF + uF = 0 or yl*+zA*+u*=0, (9.15) 


where [* = TF, A* = AF, and u* = uF; note that Var(u*) = F’ZF. Simple algebra 
shows that equations (9.15) and (9.13) have identical reduced forms. This result 
means that, without restrictions on the structural parameters, there are many equiv- 
alent structures in the sense that they lead to the same reduced form. In fact, there is 
an equivalent structure for each nonsingular F. 


T 
Let B = ( n) be the (G+ M) x G matrix of structural parameters in equation 


(9.13). If F is any nonsingular G x G matrix, then F represents an admissible linear 
transformation if 


1. BF satisfies all restrictions on B. 


2. F’=F satisfies all restrictions on XZ. 


To identify the system, we need enough prior information on the structural param- 
eters (B, £) so that F = Ig is the only admissible linear transformation. 

In most applications identification of B is of primary interest, and this identifica- 
tion is achieved by putting restrictions directly on B. As we will touch on in Section 
9.4.2, it is possible to put restrictions on X in order to identify B, but this approach is 
somewhat rare in practice. Until we come to Section 9.4.2, È is an unrestricted G x G 
positive definite matrix. 
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As before, we consider identification of the first equation: 
yy, + zd) +u = 0 (9.16) 


OF Vy + Yy +-+ + yigye + 61721 + 61222 +++: + 01uzZu + u1 = 0. The first re- 
striction we make on the parameters in equation (9.16) is the normalization restriction 
that one element of y; is —1. Each equation in the system (9.1) has a normalization 
restriction because one variable is taken to be the left-hand-side explained variable. 
In applications, there is usually a natural normalization for each equation. If there is 
not, we should ask whether the system satisfies the autonomy requirement discussed 
in Section 9.1. (Even in models that satisfy the autonomy requirement, we often have 
to choose between reasonable normalization conditions. For example, in Example 
9.1, we could have specified the second equation to be a wage offer equation rather 
than a labor demand equation.) 

Let B, = (y{,6;)’ be the (G+ M) x 1 vector of structural parameters in the first 
equation. With a normalization restriction there are (G + M) — 1 unknown elements 
in fı. Assume that prior knowledge about f, can be expressed as 


Rip; = 0, (9.17) 


where R; is a Jı x (G+ M) matrix of known constants and J; is the number of 
restrictions on f, (in addition to the normalization restriction). We assume that rank 
R; = Jı, so that there are no redundant restrictions. The restrictions in assumption 
(9.17) are sometimes called homogeneous linear restrictions, but, when coupled with a 
normalization assumption, equation (9.17) actually allows for nonhomogeneous 
restrictions. 


Example 9.2 (Three-Equation System): Consider the first equation in a system with 
G = 3 and M = 4: 
Yı = M122 + V133 + 61121 + 1222 + 61323 + 61424 + UY 


so that yi = (=1, 9125 713)’ ôI — (611,612,013, 614) , and By = (-1, 712,713,011, 012,613, 
614)’. (We can set zı = 1 to allow an intercept.) Suppose the restrictions on the 
structural parameters are y,, = 0 and 613 + 614 = 3. Then J; = 2 and 


we 190000 
13 a Oo 4 a 


Straightforward multiplication gives Rif, = (y12,813 +614 — 3)’, and setting this 
vector to zero as in equation (9.17) incorporates the restrictions on £}. 
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Given the linear restrictions in equation (9.17), when are these and the normaliza- 
tion restriction enough to identify £,;? Let F again be any G x G nonsingular matrix, 
and write it in terms of its columns as F = (f,f2,...,f¢). Define a linear transfor- 
mation of B as B* = BF, so that the first column of B* is 2} = Bf,. We need to find a 
condition so that equation (9.17) allows us to distinguish £} from any other př. For 
the moment, ignore the normalization condition. The vector £; satisfies the linear 
restrictions embodied by R, if and only if 


Rif; = Ri (Bf)) = (RiB)f; = 0. (9.18) 


Naturally, (RıB)fı = 0 is true for fı = e; = (1,0,0,...,0)', since then £f = Bf, = 
B,. Since assumption (9.18) holds for fı = e; it clearly holds for any scalar multiple 
of e;. The key to identification is that vectors of the form ce, for some constant cy, 
are the only vectors fı satisfying condition (9.18). If condition (9.18) holds for vectors 
fı other than scalar multiples of e;, then we have no hope of identifying £4. 

Stating that condition (9.18) holds only for vectors of the form c)e; just means that 
the null space of RıB has dimension unity. Equivalently, because RıB has G columns, 


rank RıB= G- 1. (9.19) 


This is the rank condition for identification of $; in the first structural equation under 
general linear restrictions. Once condition (9.19) is known to hold, the normalization 
restriction allows us to distinguish f} from any other scalar multiple of p4. 


THEOREM 9.2 (Rank Condition for Identification): Let $; be the (G+ M) x 1 vector 
of structural parameters in the first equation, with the normalization restriction that 
one of the coefficients on an endogenous variable is —1. Let the additional informa- 
tion on f, be given by restriction (9.17). Then £, is identified if and only if the rank 
condition (9.19) holds. 


As promised earlier, the rank condition in this subsection depends on the structural 
parameters, B. We can determine whether the first equation is identified by studying 
the matrix RB. Since this matrix can depend on all structural parameters, we must 
generally specify the entire structural model. 

The J; x G matrix R;B can be written as RiB = [Ri f), Rifp,..., Rif], where P; 
is the (G+ M) x 1 vector of structural parameters in equation g. By assumption 
(9.17), the first column of R;B is the zero vector. Therefore, R;B cannot have rank 
larger than G — 1. What we must check is whether the columns of R|B other than the 
first form a linearly independent set. 

Using condition (9.19), we can get a more general form of the order condition. 
Because T is nonsingular, B necessarily has rank G (full column rank). Therefore, for 
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condition (9.19) to hold, we must have rank R; > G — 1. But we have assumed that 
rank Rı = Jj, which is the row dimension of R}. 


THEOREM 9.3 (Order Condition for Identification): In system (9.12) under assump- 
tion (9.17), a necessary condition for the first equation to be identified is 


J >G-l1, (9.20) 


where J; is the row dimension of R;. Equation (9.20) is the general form of the order 
condition. 


We can summarize the steps for checking whether the first equation in the system is 
identified. 


1. Set one element of y; to —1 as a normalization. 


2. Define the J; x (G+ M) matrix R; such that equation (9.17) captures all restric- 
tions on f). 

3. If J) < G—1, the first equation is not identified. 

4. If J; > G—1, the equation might be identified. Let B be the matrix of all 
structural parameters with only the normalization restrictions imposed, and compute 


R,B. Now impose the restrictions in the entire system and check the rank condition 
(9.19). 


The simplicity of the order condition makes it attractive as a tool for studying 
identification. Nevertheless, it is not difficult to write down examples where the order 
condition is satisfied but the rank condition fails. 


Example 9.3 (Failure of the Rank Condition): Consider the following three-equation 
structural model in the population (G = 3, M = 4): 


Vi = M1292 + M13¥3 + 1121 + 61323 + U1, (9.21) 
Y2 = Yay + 62121 + ua, (9.22) 
Y3 = 03121 + 03222 + 03323 + 03424 + U3, (9.23) 


where z; = 1, E(u,) = 0, g = 1,2,3, and each z; is uncorrelated with each ug. Note 
that the third equation is already a reduced form equation (although it may also have 
a structural interpretation). In equation (9.21) we have set y;, = —1, dj. = 0, and 
014 = 0. Since this equation contains two right-hand-side endogenous variables and 
there are two excluded exogenous variables, it passes the order condition. 
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To check the rank condition, let 6, denote the 7 x 1 vector of parameters in the 
first equation with only the normalization restriction imposed: $; = (—1, 742,713,011, 
612,013,014). The restrictions 0}. = 0 and 6,4 = 0 are obtained by choosing 


rR- (0000100 
'~\o 000001) 


Let B be the full 7 x 3 matrix of parameters with only the three normalizations 


imposed (so that By = (y21,—1,723,621,622,623,024)’ and B3 = (731,732, —1,631, 032, 
633,034)'). Matrix multiplication gives 


012 ôn 032 ) 
R;B= |. i . 
! & Or4 (034 


Now we impose all of the restrictions in the system. In addition to the restrictions 
O12 = 0 and 6,4 = 0 from equation (9.21), we also have ôn = 0 and ôx = 0 from 
equation (9.22). Therefore, with all restrictions imposed, 


0 0 dx 
RıB = . 9.24 
l a =) en) 


The rank of this matrix is at most unity, and so the rank condition fails because 
G-—1=2. 

Equation (9.22) easily passes the order condition. What about the rank condition? 
If Ro is the 4 x 7 matrix imposing the restrictions on f», namely, y,; = 0, 62 = 0, 
623 = 0, and 674 = 0, then it is easily seen that, with the restrictions on the entire 
system imposed, 


y3 0 -1 

0 0 63 

RB = = 
? ô 0 033 |? 

0 0 0634 


and we need this matrix to have rank equal to two if (9.22) is identified. A sufficient 
condition is 6,3 # 0 and at least one of 63 and 034 is different from zero. The rank 
condition fails if y,; = 613 = 0, in which case yı and yz form a two-equation system 
with only one exogenous variable, z}, appearing in both equations. The third equa- 
tion, (9.23), is identified because it contains no endogenous explanatory variables. 


When the restrictions on f; consist entirely of normalization and exclusion re- 
strictions, the order condition (9.20) reduces to the order condition (9.11), as can be 
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seen by the following argument. When all restrictions are exclusion restrictions, the 
matrix Rı consists only of zeros and ones, and the number of rows in R; equals 
the number of excluded right-hand-side endogenous variables, G — Gi — 1, plus the 
number of excluded exogenous variables, M — Mı. In other words, Jı = (G— Gi — 1) + 
(M — Mı), and so the order condition (9.20) becomes (G — Gi — 1) + (M — Mı) > 
G — 1, which, upon rearrangement, becomes condition (9.11). 


9.2.3 Unidentified, Just Identified, and Overidentified Equations 


We have seen that, for identifying a single equation, the rank condition (9.19) is 
necessary and sufficient. When condition (9.19) fails, we say that the equation is 
unidentified. 

When the rank condition holds, it is useful to refine the sense in which the equation 
is identified. If J; = G — 1, then we have just enough identifying information. If we 
were to drop one restriction in R;, we would necessarily lose identification of the first 
equation because the order condition would fail. Therefore, when J; = G — 1, we say 
that the equation is just identified. 

If J; > G—1, it is often possible to drop one or more restrictions on the param- 
eters of the first equation and still achieve identification. In this case we say the 
equation is overidentified. Necessary but not sufficient for overidentification is 
Jı > G—1. It is possible that J; is strictly greater than G — 1 but the restrictions are 
such that dropping one restriction loses identification, in which case the equation is not 
overidentified. 

In practice, we often appeal to the order condition to determine the degree of 
overidentification. While in special circumstances this approach can fail to be accu- 
rate, for most applications it is reasonable. Thus, for the first equation, J; — (G — 1) 
is usually intepreted as the number of overidentifying restrictions. 


Example 9.4 (Overidentifying Restrictions): Consider the two-equation system 
Yı = V2 )2 + 61121 + 61222 + 61323 + Ô1424 + U1, (9.25) 
Yo = Yay + 62121 + 62222 + ua, (9.26) 


where E(z;u,) = 0, all j and g. Without further restrictions, equation (9.25) fails the 
order condition because every exogenous variable appears on the right-hand side, 
and the equation contains an endogenous variable. Using the order condition, equa- 
tion (9.26) is overidentified, with one overidentifying restriction. If z3 does not actu- 
ally appear in equation (9.25), then equation (9.26) is just identified, assuming that 
O14 #0. 
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9.3 Estimation after Identification 


9.3.1 Robustness-Efficiency Trade-off 


All SEMs with linearly homogeneous restrictions within each equation can be written 
with exclusion restrictions as in the system (9.1); doing so may require redefining 
some of the variables. If we let x(4) = (Y(y),Z(g)) and Big) = (Yig big)’: then the sys- 
tem (9.1) is in the general form (8.11) with the slight change in notation. Under as- 
sumption (9.2) the matrix of instruments for observation i is the G x GM matrix 


Zi = Ic ® Zi. (9.27) 


If every equation in the system passes the rank condition, a system estimation 
procedure—such as 3SLS or the more general minimum chi-square estimator—can 
be used. Alternatively, the equations of interest can be estimated by 2SLS. The bot- 
tom line is that the methods studied in Chapters 5 and 8 are directly applicable. All of 
the tests we have covered apply, including the tests of overidentifying restrictions in 
Chapters 6 and 8 and the single-equation tests for endogeneity in Chapter 6. 

When estimating a simultaneous equations system, it is important to remember the 
pros and cons of full system estimation. If all equations are correctly specified, system 
procedures are asymptotically more efficient than a single-equation procedure such as 
2SLS. But single-equation methods are more robust. If interest lies, say, in the first 
equation of a system, 2SLS is consistent and asymptotically normal, provided the 
first equation is correctly specified and the instruments are exogenous. However, if 
one equation in a system is misspecified, the 3SLS or GMM estimates of all the pa- 
rameters are generally inconsistent. 


Example 9.5 (Labor Supply for Married, Working Women): Using the data in 
MROZ.RAW, we estimate a labor supply function for working, married women. 
Rather than specify a demand function, we specify the second equation as a wage 
offer function and impose the equilibrium condition: 


hours = yı, log(wage) + 019 + 0;,educ + ôi2age + 6)3kidslt6 
+ O\4kidsge6 + d\snwifeinc + uy, (9.28) 
log(wage) = yo, hours + 02 + 62,educ + dxexper + 623exper” +, (9.29) 


where kids/t6 is number of children less than 6, kidsge6 is number of children between 
6 and 18, and nwifeinc is income other than the woman’s labor income. We assume 
that uw; and u have zero mean conditional on educ, age, kidslt6, kidsge6, nwifeinc, 
and exper. 
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The key restriction on the labor supply function is that exper (and exper”) have no 
direct effect on current annual hours. This identifies the labor supply function with 
one overidentifying restriction, as used by Mroz (1987). We estimate the labor supply 
function first by OLS (to see what ignoring the endogeneity of log(wage) does) and 
then by 2SLS, using as instruments all exogenous variables in equations (9.28) and 
(9.29). 

There are 428 women who worked at some time during the survey year, 1975. The 
average annual hours are about 1,303, with a minimum of 12 and a maximum of 
4,950. 

We first estimate the labor supply function by OLS: 


hours = 2,114.7 — 17.41 log(wage) — 14.44 educ — 7.73 age 
(340.1) (54.22) (17.97) (5.53) 


— 342.50 kidslt6 — 115.02 kidsge6 — 4.35 nwifeinc. 
(100.01) (30.83) (3.66) 


The OLS estimates indicate a downward-sloping labor supply function, although the 
estimate on log(wage) is statistically insignificant. 
The estimates are much different when we use 2SLS: 


hours = 2,432.2 +1,544.82 log(wage) — 177.45 educ — 10.78 age 


(594.2) (480.74) (58.14) (9.58) 
— 210.83 kidslt6 — 47.56 kidsge6 — 9.25 nwifeinc. 
(176.93) (56.92) (6.48) 


The estimated labor supply elasticity is 1,544.82 /hours. At the mean hours for work- 
ing women, 1,303, the estimated elasticity is about 1.2, which is quite large. 

The supply equation has a single overidentifying restriction. The regression of the 
2SLS residuals % on all exogenous variables produces R = .002, and so the test 
statistic is 428(.002) ~ .856 with p-value ~ .355; the overidentifying restriction is not 
rejected. 

Under the exclusion restrictions we have imposed, the wage offer function (9.29) is 
also identified. Before estimating the equation by 2SLS, we first estimate the reduced 
form for hours to ensure that the exogenous variables excluded from equation (9.29) 
are jointly significant. The p-value for the F test of joint significance of age, kidslt6, 
kidsge6, and nwifeinc is about .0009. Therefore, we can proceed with 2SLS estimation 
of the wage offer equation. The coefficient on hours is about .00016 (standard 
error ~ .00022), and so the wage offer does not appear to differ by hours worked. The 
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remaining coefficients are similar to what is obtained by dropping hours from equa- 
tion (9.29) and estimating the equation by OLS. (For example, the 2SLS coefficient 
on education is about .111 with se ~ .015.) 

Interestingly, while the wage offer function (9.29) is identified, the analogous labor 
demand function is apparently unidentified. (This finding shows that choosing the 
normalization—that is, choosing between a labor demand function and a wage offer 
function—is not innocuous.) The labor demand function, written in equilibrium, 
would look like this: 


hours = yy) log(wage) + 629 + 62;educ + dy,exper + dx3exper? + up. (9.30) 


Estimating the reduced form for log(wage) and testing for joint significance of age, 
kidslt6, kidsge6, and nwifeinc yields a p-value of about .46, and so the exogenous 
variables excluded from equation (9.30) would not seem to appear in the reduced 
form for log(wage). Estimation of equation (9.30) by 2SLS would be pointless. (You 
are invited to estimate equation (9.30) by 2SLS to see what happens.) 

It would be more efficient to estimate equations (9.28) and (9.29) by 3SLS, since 
each equation is overidentified (assuming the homoskedasticity assumption SIV.5). If 
heteroskedasticity is suspected, we could use the general minimum chi-square esti- 
mator. A system procedure is more efficient for estimating the labor supply function 
because it uses the information that age, kids/t6, kidsge6, and nwifeinc do not appear 
in the log(wage) equation. If these exclusion restrictions are wrong, the 3SLS esti- 
mators of parameters in both equations are generally inconsistent. Problem 9.9 asks 
you to obtain the 3SLS estimates for this example. 


9.3.2 When Are 2SLS and 3SLS Equivalent? 


In Section 8.4 we discussed the relationship between system 2SLS and (GMM) 3SLS 
for a general linear system. Applying that discussion to linear SEMs, we can imme- 
diately draw the following conclusions. First, if each equation is just identified, 2SLS 
equation by equation and 3SLS are identical; in fact, each is identical to the system 
IV estimator. Basically, there is only one consistent estimator, the IV estimator on 
each equation. Second, regardless of the degree of overidentification, 2SLS equation 
by equation and 3SLS are identical when Ê is diagonal. (As a practical matter, this 
occurs only if we force Ê to be diagonal.) 

There are several other useful algebraic equivalences that have been derived else- 
where. Suppose that the first equation in the system is overidentified but every other 
equation in the system is just identified. Then the 2SLS estimator of the first equation 
is identical to its 3SLS estimates of the entire system. (A special case occurs when the 
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first equation is a structural equation and all other equations are unrestricted reduced 
forms.) As an extension of this result, suppose for an identified system we put each 
equation in an identified system into two groups, just identified and overidentified. 
Then for the overidentified set of equations, the 3SLS estimates based on the entire 
system are identical to the 3SLS estimates obtained using only the overidentified 
subset. Of course, 3SLS estimation on the just identified set of equations is equiv- 
alent to 2SLS estimation on each equation, and this generally differs from the 
3SLS estimates based on the entire system. See Schmidt (1976, Theorem 5.2.13) for 
verification. 


9.3.3 Estimating the Reduced Form Parameters 


So far, we have discussed estimation of the structural parameters. The usual justifi- 
cations for focusing on the structural parameters are as follows: (1) we are interested 
in estimates of “economic parameters” (such as labor supply elasticities) for curi- 
osity’s sake; (2) estimates of structural parameters allow us to obtain the effects of a 
variety of policy interventions (such as changes in tax rates); and (3) even if we want 
to estimate the reduced form parameters, we often can do so more efficiently by first 
estimating the structural parameters. Concerning the second reason, if the goal is to 
estimate, say, the equilibrium change in hours worked given an exogenous change in 
a marginal tax rate, we must ultimately estimate the reduced form. 

As another example, we might want to estimate the effect on county-level alcohol 
consumption due to an increase in exogenous alcohol taxes. In other words, we are 
interested in 0E(y, | z)/0z; = ngj, where y, is alcohol consumption and z; is the tax 
on alcohol. Under weak assumptions, reduced form equations exist, and each equa- 
tion of the reduced form can be estimated by ordinary least squares. Without placing 
any restrictions on the reduced form, OLS equation by equation is identical to SUR 
estimation (see Section 7.7). In other words, we do not need to analyze the structural 
equations at all in order to consistently estimate the reduced form parameters. Ordi- 
nary least squares estimates of the reduced form parameters are robust in the sense 
that they do not rely on any identification assumptions imposed on the structural 
system. 

If the structural model is correctly specified and at least one equation is over- 
identified, we obtain asymptotically more efficient estimators of the reduced form 
parameters by deriving the estimates from the structural parameter estimates. In 
particular, given the structural Sanna estimates A and I’, we can obtain the re- 
duced form estimates as If = —AIT~! (see equation (9.14)). These are consistent, VN- 
asymptotically normal estimators (although the asymptotic variance matrix is some- 
what complicated). From Problem 3.9, we obtain the most efficient estimator of IT by 
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using the most efficient estimators of A and F (minimum chi-square or, under system 
homoskedasticity, 3SLS). 

Just as in estimating the structural parameters, there is a robustness-efficiency 
trade-off in estimating the zj. As mentioned earlier, the OLS estimators of each 
reduced form are robust to misspecification of any restrictions on the structural 
equations (although, as always, each element of z should be exogenous for OLS to be 
consistent). The estimators of the z,; derived from estimators of A and [—whether 
the latter are 2SLS or system estimators—are generally nonrobust to incorrect 
restrictions on the structural system. See Problem 9.11 for a simple illustration. 


9.4 Additional Topics in Linear Simultaneous Equations Methods 


9.4.1 Using Cross Equation Restrictions to Achieve Identification 


So far we have discussed identification of a single equation using only within-equation 
parameter restrictions (see assumption (9.17)). This is by far the leading case, espe- 
cially when the system represents a simultaneous equations model with truly auton- 
omous equations. In fact, it is hard to think of a sensible example with true 
simultaneity where one would feel comfortable imposing restrictions across equa- 
tions. In supply and demand applications we typically do not think supply and demand 
parameters, which represent different sides of a market, would satisfy any known 
restrictions. In studying the relationship between college crime rates and arrest rates, 
any restrictions on the parameters across the two equations would be arbitrary. 
Nevertheless, there are examples where endogeneity is caused by omitted variables or 
measurement error where economic theory imposes cross equation restrictions. An 
example is a system of expenditure shares when total expenditures or prices are 
thought to be measured with error: the symmetry condition imposes cross equation 
restrictions. Not surprisingly, such cross equation restrictions are generally useful for 
identifying equations. A general treatment is beyond the scope of our analysis. Here 
we just give an example to show how identification and estimation work. 
Consider the two-equation system 


Yı = M22 + 61121 + 01222 + 61323 + U1, (9.31) 
Y2 = Yaiyi + 62121 + 62222 + u2, (9.32) 


where each z; is uncorrelated with uw and w (zı can be unity to allow for an inter- 
cept). Without further information, equation (9.31) is unidentified, and equation 
(9.32) is just identified if and only if 63 4 0. We maintain these assumptions in what 
follows. 
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Now suppose that 0j2 = 622. Because 62 is identified in equation (9.32) we can 
treat it as known for studying identification of equation (9.31). But ô12 = 02, and so 
we can write 


Yı — 01222 = Y12V2 + 01121 + 61323 + U1, (9.33) 


where y; — 01222 is effectively known. Now the right-hand side of equation (9.33) has 
one endogenous variable, y), and the two exogenous variables zı and z3. Because z2 
is excluded from the right-hand side, we can use z2 as an instrument for yy, as long as 
Z2 appears in the reduced form for y,. This is the case provided 612 = 62) # 0. 

This approach to showing that equation (9.31) is identified also suggests a consis- 
tent estimation procedure: first, estimate equation (9.32) by 2SLS using (z1, 22, 23) as 
instruments, and let ôn be the estimator of 622. Then, estimate 


V7 60929 = Y12V2 + 1121 + 61323 + error 


by 2SLS using (21, 22, 23) as instruments. Since ds x 012 when 612 = 62 ¥ 0, this last 
step produces consistent estimators of y;5, 611, and 613. Unfortunately, the usual 2SLS 
standard errors obtained from the final estimation would not be valid because of the 
preliminary estimation of 629. 

It is easier to use a system procedure when cross equation restrictions are present 
because the asymptotic variance can be obtained directly. We can always rewrite the 
system in a linear form with the restrictions imposed. For this example, one way to 
do so is to write the system as 


Yı h 2 2 273 0 0 ) (" ) 
2 + , 9.34 
( V2 ) ( 0027 0O nm 4 f u2 724) 


where $ = (71,611,012,013, 21,021). The parameter 52. does not show up in £ be- 
cause we have imposed the restriction 612 = 622 by appropriate choice of the matrix 
of explanatory variables. 

The matrix of instruments is L © z, meaning that we just use all exogenous vari- 
ables as instruments in each equation. Since I, ® z has six columns, the order condi- 
tion is exactly satisfied (there are six elements of $), and we have already seen when 
the rank condition holds. The system can be consistently estimated using GMM or 
3SLS. 


9.4.2 Using Covariance Restrictions to Achieve Identification 


In most applications of linear SEMs, identification is obtained by putting restrictions 
on the matrix of structural parameters B. It is also possible to identify the elements of 
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B by placing restrictions on the variance matrix È of the structural errors. Usually the 
restrictions come in the form of zero covariance assumptions. In microeconomic 
applications of models with true simultaneity, it is difficult to think of examples of 
autonomous systems of equations where it makes sense to assume the errors across 
the different structural equations are uncorrelated. For example, in individual-level 
labor supply and demand functions, what would be the justification for assuming that 
unobserved factors affecting a person’s labor supply decisions are uncorrelated with 
unobserved factors that make the person a more or less desirable worker? In a model 
of city crime rates and spending on law enforcement, it hardly makes sense to assume 
that unobserved city features that affect crime rates are uncorrelated with those 
determining law enforcement expenditures. Nevertheless, there are applications 
where zero covariance assumptions can make sense, and so we provide a brief treat- 
ment here, using examples to illustrate the approach. General analyses of identifica- 
tion with restrictions on È are given in Hausman (1983) and Hausman, Newey, and 
Taylor (1987). 
The first example is the two-equation system 


Yı = M22 + 61121 +.61323 + U1, (9.35) 
Y2 = ParV1 + 62121 + 2222 + 62323 + u2. (9.36) 


Equation (9.35) is just identified if 622 # 0, which we assume, while equation (9.36) is 
unidentified without more information. Suppose that we have one piece of additional 
information in terms of a covariance restriction: 


Cov(u, u2) = E(ujuz) = 0. (9.37) 


In other words, if X is the 2 x 2 structural variance matrix, we are assuming that © is 
diagonal. Assumption (9.37), along with 62. # 0, is enough to identify equation (9.36). 

Here is a simple way to see how assumption (9.37) identifies equation (9.36). First, 
because y,, 011, and 613 are identified, we can treat them as known when studying 
identification of equation (9.36). But if the parameters in equation (9.35) are known, 
u1 is effectively known. By assumption (9.37), wu is uncorrelated with u2, and u is 
certainly partially correlated with y,. Thus, we effectively have (z1, Z2, Z3, u1) as in- 
struments available for estimating equation (9.36), and this result shows that equa- 
tion (9.36) is identified. 

We can use this method for verifying identification to obtain consistent estimators. 
First, estimate equation (9.35) by 2SLS using instruments (zı, 22,23) and save the 
2SLS residuals, i. Then estimate equation (9.36) by 2SLS using instruments 
(21, Z2, Z3, ù1). The fact that ù, depends on estimates from a prior stage does not affect 
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consistency. But inference is complicated because of the estimation of u1: condition 
(6.8) does not hold because u; depends on y, which is correlated with up. 

The most efficient way to use covariance restrictions is to write the entire set of 
orthogonality conditions as E[z'w;(B,)| = 0, E[z'u2(f,)] = 0, and 


E[u (B;)u2(B2)] = 9, (9.38) 


where the notation u;(f,) emphasizes that the errors are functions of the structural 
parameters #,—with normalization and exclusion restrictions imposed—and simi- 
larly for uw2(f,). For example, from equation (9.35), ui (61) = yı — Y12 y2 — O21 — 
01323. Equation (9.38), because it is nonlinear in f; and f,, takes us outside the realm 
of linear moment restrictions. In Chapter 14 we will use nonlinear moment con- 
ditions in GMM estimation. 

A general example with covariance restrictions is a fully recursive system. First, a 
recursive system can be written as 


yı = Zð; + u1, 


V2 = Y21V1 + Zô2 + w2, 


Y3 = V311 + Y32V2 + 263 + u3, (9.39) 


Ye = Vai +: +: + YG,G-1VG-1 + 26g + UG, 


so that in each equation, only endogenous variables from previous equations appear 
on the right-hand side. We have allowed all exogenous variables to appear in each 
equation, and we maintain assumption (9.2). 

The first equation in the system (9.39) is clearly identified and can be estimated by 
OLS. Without further exclusion restrictions none of the remaining equations is iden- 
tified, but each is identified if we assume that the structural errors are pairwise 
uncorrelated: 


Cov(ug, un) = 0, g#éh. (9.40) 


This assumption means that È is a G x G diagonal matrix. Equations (9.39) and 
(9.40) define a fully recursive system. Under these assumptions, the right-hand-side 
variables in equation g are each uncorrelated with ug; this fact is easily seen by 
starting with the first equation and noting that y; is a linear function of z and u. 
Then, in the second equation, y, is uncorrelated with uz under assumption (9.40). But 
yə is a linear function of z, u, and u2, and so y, and y; are both uncorrelated with u3 


260 Chapter 9 


in the third equation. And so on. It follows that each equation in the system is con- 
sistently estimated by ordinary least squares. 

It turns out that OLS equation by equation is not necessarily the most efficient 
estimator in fully recursive systems, even though © is a diagonal matrix. Generally, 
efficiency can be improved by adding the zero covariance restrictions to the ortho- 
gonality conditions, as in equation (9.38), and applying nonlinear GMM estimation. 
See Lahiri and Schmidt (1978) and Hausman, Newey, and Taylor (1987). 


9.4.3 Subtleties Concerning Identification and Efficiency in Linear Systems 


So far we have discussed identification and estimation under the assumption that 
each exogenous variable appearing in the system, z;, is uncorrelated with each struc- 
tural error, ug. It is important to assume only zero correlation in the general treat- 
ment because we often add a reduced form equation for an endogenous variable to a 
structural system, and zero correlation is all we should impose in linear reduced 
forms. 

For entirely structural systems, it is often natural to assume that the structural 
errors satisfy the zero conditional mean assumption 


E(u, |Z) = 0, g= 1, 2y..2 9G, (9.41) 


In addition to giving the parameters in the structural equations the appropriate par- 
tial effect interpretations, assumption (9.41) has some interesting statistical impli- 
cations: any function of z is uncorrelated with each error ug. Therefore, in the labor 
supply example (9.28), age”, log(age), educ-exper, and so on (there are too many 
functions to list) are all uncorrelated with u; and u2. Realizing this fact, we might ask, 
why not use nonlinear functions of z as additional instruments in estimation? 

We need to break the answer to this question into two parts. The first concerns 
identification and the second concerns efficiency. For identification, the bottom line is 
this: adding nonlinear functions of z to the instrument list cannot help with identifi- 
cation in linear systems. You were asked to show this generally in Problem 8.4, but 
the main points can be illustrated with a simple model: 


Yı = M1292 + 61121 + 61222 + U1, (9.42) 
Y2 = Yay + 62121 + u2, (9.43) 
E(ui| z) = E(u |z) = 0. (9.44) 


From the order condition in Section 9.2.2, equation (9.42) is not identified, and 
equation (9.43) is identified if and only if 012 4 0. Knowing properties of conditional 
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expectations, we might try something clever to identify equation (9.42): since, say, z? 
is uncorrelated with u; under assumption (9.41), and z? would appear to be corre- 
lated with y), we can use it as an instrument for y, in equation (9.42). Under this 
reasoning, we would have enough instruments—z), z2, 7 —to identify equation (9.42). 
In fact, any number of functions of zı and z can be added to the instrument list. 

The fact that this argument is faulty is fortunate because our identification analysis 
in Section 9.2.2 says that equation (9.42) is not identified. In this example it is clear 
that z? cannot appear in the reduced form for y, because z? appears nowhere in the 
system. Technically, because E( y, |z) is linear in z; and z2 under assumption (9.44), 
the linear projection of y, onto (z1, 22,27) does not depend on z?: 


L(y | 21, 22,21) = L(y | 21, 22) = 72121 + 7222. (9.45) 


In other words, there is no partial correlation between y, and A once zı and z? are 
included in the projection. 

The zero conditional mean assumptions (9.41) can have some relevance for 
choosing an efficient estimator, although not always. If assumption (9.41) holds and 
Var(u|z) = Var(u) = £, 3SLS using instruments z for each equation is the asymp- 
totically efficient estimator that uses the orthogonality conditions in assumption 
(9.41); this conclusion follows from Theorem 8.5. In other words, if Var(u |z) is 
constant, it does not help to expand the instrument list beyond the functions of the 
exogenous variables actually appearing in the system. 

If assumption (9.41) holds but Var(u|z) is not constant, then we know from 
Chapter 8 that the GMM 3SLS estimator is not an efficient GMM estimator when 
the system is overidentified. In fact, under (9.41), without homoskedasticity there is 
no need to stop at instruments z, at least in theory. Why? As we discussed in Section 
8.6, one never does worse asymptotically by adding more valid instruments. Under 
(9.41), any functions of z, collected in the vector h(z), are uncorrelated with every 
element of u. Therefore, we can generally improve over the optimal GMM (minimum 
chi-square) estimator that uses IVs z by applying optimal GMM with instruments 
[z,h(z)]. This result was discovered independently by Hansen (1982) and White 
(1982b). Expanding the IV list to arbitrary functions of z and applying full GMM is 
not used very much in practice: it is usually not clear how to choose h(z), and, if we 
use too many additional instruments, the finite sample properties of the GMM esti- 
mator can be poor, as we discussed in Section 8.6. 

For SEMs linear in the parameters but nonlinear in endogenous variables (in a 
sense to be made precise), adding nonlinear functions of the exogenous variables to 
the instruments not only is desirable but is often needed to achieve identification. We 
turn to this topic next. 
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9.5 Simultaneous Equations Models Nonlinear in Endogenous Variables 


We now study models that are nonlinear in some endogenous variables. While the 
general estimation methods we have covered are still applicable, identification and 
choice of instruments require special attention. 


95.1 Identification 


The issues that arise in identifying models nonlinear in endogenous variables are 
most easily illustrated with a simple example. Suppose that supply and demand are 
given by 


log(q) = yy log(p) + y:3[log(p)]? + 81121 + u, (9.46) 
log(q) = Y22 log(p) + 62222 + u2, (9.47) 
E(u; |z) = E(u |z) = 0, (9.48) 


where the first equation is the supply equation, the second equation is the demand 
equation, and the equilibrium condition that supply equals demand has been imposed. 
For simplicity, we do not include an intercept in either equation, but no important 
conclusions hinge on this omission. The exogenous variable zı shifts the supply 
function but not the demand function; z2 shifts the demand function but not the 
supply function. The vector of exogenous variables appearing somewhere in the sys- 
tem is Z = (2), 22). 

It is important to understand why equations (9.46) and (9.47) constitute a ‘‘non- 
linear” system. This system is still linear in parameters, which is important because it 
means that the IV procedures we have learned up to this point are still applicable. 
Further, it is not the presence of the logarithmic transformations of g and p that 
makes the system nonlinear. In fact, if we set y}, = 0, then the model is linear for the 
purposes of identification and estimation: defining yı =log(q) and y, = log(p), we 
can write equations (9.46) and (9.47) as a standard two-equation system. 

When we include [log(p)]? we have the model 


Yı = 7122 + Ma +Ô11Z1 +u, (9.49) 


Yı = Pog Y2 + 2222 + u2. (9.50) 


With this system there is no way to define two endogenous variables such that the 
system is a two-equation system linear in two endogenous variables. The presence of 
y4 in equation (9.49) makes this model different from those we have studied up until 
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now. We say that this is a system nonlinear in endogenous variables, and this entails a 
different treatment of identification. 

If we used equations (9.49) and (9.50) to obtain y, as a function of the 2), Z2, u1, u2, 
and the parameters, the result would not be linear in z and u. In this particular case 
we can find the solution for y, using the quadratic formula (assuming a real solution 
exists). However, E( y, |z) would not be linear in z unless y}; = 0, and E( y2 | z) would 
not be linear in z regardless of the value of yı}. These observations have important 
implications for identification of equation (9.49) and for choosing instruments. 

Before considering equations (9.49) and (9.50) further, consider a second example 
where closed form expressions for the endogenous variables in terms of the exoge- 
nous variables and structural errors do not even exist. Suppose that a system de- 
scribing crime rates in terms of law enforcement spending is 


crime = yı, log(spending) + 2)6(1) + u, (9.51) 
spending = y,crime + yy,crime? + 2(2)6(2) + U2, (9.52) 


where the errors have zero mean given z. Here, we cannot solve for either crime 
or spending (or any other transformation of them) in terms of z, ui, u2, and the 
parameters. And there is no way to define y; and y, to yield a linear SEM in two 
endogenous variables. The model is still linear in parameters, but E(crime |z), 
Ellog(spending) | z|}, and E(spending|z) are not linear in z (nor can we find closed 
forms for these expectations). 

One possible approach to identification in nonlinear SEMs is to ignore the fact that 
the same endogenous variables show up differently in different equations. In the supply 
and demand example, define y, = y} and rewrite equation (9.49) as 


Yı = M1292 + V133 + 1121 + u1. (9.53) 
Or, in equations (9.51) and (9.52) define yı =crime, y= spending, y3 = 
log(spending), and y4 = crime, and write 

Yı = 7123 + Zaydy + u1, (9.54) 
Y2 = Ya V1 + Y22.Y4 + £(2)8(2) + u2. (9.55) 


Defining nonlinear functions of endogenous variables as new endogenous variables 
turns out to work fairly generally, provided we apply the rank and order conditions 
properly. The key question is, what kinds of equations do we add to the system for 
the newly defined endogenous variables? 

If we add linear projections of the newly defined endogenous variables in terms of 
the original exogenous variables appearing somewhere in the system—that is, the 
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linear projection onto z—then we are being much too restrictive. For example, sup- 
pose to equations (9.53) and (9.50) we add the linear equation 


Y3 = 13121 + 13222 + V3, (9.56) 


where, by definition, E(zjv3) = E(z2v3) = 0. With equation (9.56) to round out the 
system, the order condition for identification of equation (9.53) clearly fails: we have 
two endogenous variables in equation (9.53) but only one excluded exogenous vari- 
able, z2. 

The conclusion that equation (9.53) is not identified is too pessimistic. There are 
many other possible instruments available for y}. Because E( y2 | z) is not linear in z; 
and z (even if yı} = 0), other functions of zı and z will appear in a linear projection 
involving y3 as the dependent variable. To see what the most useful of these are likely 
to be, suppose that the structural system actually is linear, so that y;,; = 0. Then 
Y2 = M2121 + M2222 + v2, Where v2 is a linear combination of uw; and u2. Squaring this 
reduced form and using E(v2|z) = 0 gives 


E(y3 |Z) = 7321 + 13923 + 22171222122 + E(v} | Z). (9.57) 


If E(v;|z) is constant, an assumption that holds under homoskedasticity of the 
structural errors, then equation (9.57) shows that y2 is correlated with z?, z2, and 
2122, which makes these functions natural instruments for y2. The only case where no 
functions of z are correlated with y occurs when both m2; and z2 equal zero, in 
which case the linear version of equation (9.49) (with y,; = 0) is also unidentified. 

Because we derived equation (9.57) under the restrictive assumptions y,;; = 0 and 
homoskedasticity of v2, we would not want our linear projection for y% to omit the 
exogenous variables that originally appear in the system. (Plus, for simplicity, we 
omitted intercepts from the equations.) In practice, we would augment equations 
(9.53) and (9.50) with the linear projection 


2 2 ee, 
Y3 = 13121 + 13222 + 1332] + 13429 + 1352122 + V3, (9.58) 


where v3 is, by definition, uncorrelated with z1, 22, 27, 23, and z1z2. The system (9.53), 
(9.50), and (9.58) can now be studied using the usual rank condition. 

Adding equation (9.58) to the original system and then studying the rank condition 
of the first two equations is equivalent to studying the rank condition in the smaller 
system (9.53) and (9.50). What we mean by this statement is that we do not explicitly 
add an equation for y, = y, but we do include y, in equation (9.53). Therefore, 
when applying the rank condition to equation (9.53), we use G = 2 (not G = 3). The 
reason this approach is the same as studying the rank condition in the three-equation 
system (9.53), (9.50), and (9.58) is that adding the third equation increases the rank of 
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R,B by one whenever at least one additional nonlinear function of z appears in 
equation (9.58). (The functions z?, 23, and z1z. appear nowhere else in the system.) 

As a general approach to identification in models where the nonlinear functions of 
the endogenous variables depend only on a single endogenous variable—such as the 
two examples that we have already covered—Fisher (1965) argues that the following 
method is sufficient for identification: 


1. Relabel the nonredundant functions of the endogenous variables to be new 
endogenous variables, as in equation (9.53) or in equations (9.54) and (9.55). 


2. Apply the rank condition to the original system without increasing the number of 
equations. If the equation of interest satisfies the rank condition, then it is identified. 


The proof that this method works is complicated, and it requires more assumptions 
than we have made (such as u being independent of z). Intuitively, we can expect each 
additional nonlinear function of the endogenous variables to have a linear projection 
that depends on new functions of the exogenous variables. Each time we add another 
function of an endogenous variable, it effectively comes with its own instruments. 

Fisher’s method can be expected to work in all but the most pathological cases. 
One case where it does not work is if E(v3|z) in equation (9.57) is heteroskedastic 
in such a way as to cancel out the squares and cross product terms in zı and z2; then 
E(¥3|z) would be constant. Such unfortunate coincidences are not practically 
important. 

It is tempting to think that Fisher’s rank condition is also necessary for identifica- 
tion, but this is not the case. To see why, consider the two-equation system 


Yı = Y2 V2 + M33 + 6uz1 +61222 + U1, (9.59) 
Y2 = Yay + 62121 + U2. (9.60) 


The first equation clearly fails the modified rank condition because it fails the order 
condition: there are no restrictions on the first equation except the normalization re- 
striction. However, if y}, #0 and y,, # 0, then E(y,|z) is a nonlinear function of z 
(which we cannot obtain in closed form). The result is that functions such as 27, 23, 
and zız2 (and others) will appear in the linear projections of y, and y4 even after zı 
and z have been included, and these can then be used as instruments for y, and y3. 
But if y,, = 0, the first equation cannot be identified by adding nonlinear functions of 
zı and Zp to the instrument list: the linear projection of y, on z1, Z2, and any function 
of (z1,22) will only depend on z; and zp. 

Equation (9.59) is an example of a poorly identified model because, when it is 
identified, it is identified due to a nonlinearity (y,, 4 0 in this case). Such identification 
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is especially tenuous because the hypothesis Ho: yı} = 0 cannot be tested by estimating 
the structural equation (since the structural equation is not identified when Ho holds). 

There are other models where identification can be verified using reasoning similar 
to that used in the labor supply example. Models with interactions between exoge- 
nous variables and endogenous variables can be shown to be identified when the 
model without the interactions is identified (see Example 6.2 and Problem 9.6). 
Models with interactions among endogenous variables are also fairly easy to handle. 
Generally, it is good practice to check whether the most general /inear version of the 
model would be identified. If it is, then the nonlinear version of the model is probably 
identified. We saw this result in equation (9.46): if this equation is identified when 
¥13 = 0, then it is identified for any value of y,;. If the most general linear version of a 
nonlinear model is not identified, we should be very wary about proceeding, since 
identification hinges on the presence of nonlinearities that we usually will not be able 
to test. 


9.5.2 Estimation 


In practice, it is difficult to know which additional functions we should add to the 
instrument list for nonlinear SEMs. Naturally, we must always include the exogenous 
variables appearing somewhere in the system instruments in every equation. After 
that, the choice is somewhat arbitrary, although the functional forms appearing in 
the structural equations can be helpful. 

A general approach is to always use some squares and cross products of the exog- 
enous variables appearing somewhere in the system. If exper and exper? appear in the 
system, additional terms such as exper? and expert are natural additions to the in- 
strument list. 

Once we decide on a set of instruments, any equation in a nonlinear SEM can be 
estimated by 2SLS. Because each equation satisfies the assumptions of single-equation 
analysis, we can use everything we have learned up to now for inference and specifi- 
cation testing for 2SLS. A system method can also be used, where linear projections 
for the functions of endogenous variables are explicitly added to the system. Then, all 
exogenous variables included in these linear projections can be used as the instru- 
ments for every equation. The minimum chi-square estimator is generally more ap- 
propriate than 3SLS because the homoskedasticity assumption will rarely be satisfied 
in the linear projections. 

It is important to apply the instrumental variables procedures directly to the 
structural equation or equations. In other words, we should directly use the formulas 
for 2SLS, 3SLS, or GMM. Trying to mimic 2SLS or 3SLS by substituting fitted 
values for some of the endogenous variables inside the nonlinear functions is usually 
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a mistake: neither the conditional expectation nor the linear projection operator 
passes through nonlinear functions, and so such attempts rarely produce consistent 
estimators in nonlinear systems. 


Example 9.6 (Nonlinear Labor Supply Function): We add [log(wage)]° to the labor 
supply function in Example 9.5: 


hours = yı log(wage) + y;3[log(wage)|* + 519 + ô11educ + ônage 
+ 013kidslt6 + d\4kidsge6 + 0\snwifeinc + u, (9.61) 
log(wage) = 029 + ô21 educ + do2exper + 573exper + uy, (9.62) 


where we have dropped hours from the wage offer function because it was insig- 
nificant in Example 9.5. The natural assumptions in this system are E(w|z) = 
E(u |z) = 0, where z contains all variables other than hours and log(wage). 

There are many possibilities as additional instruments for [log(wage)]”. Here, we 
add three quadratic terms to the list—age?, educ?, and nwifeinc?—and we estimate 
equation (9.61) by 2SLS. We obtain ĵi) = 1,873.62 (se = 635.99) and f} = —437.29 
(se = 350.08). The ¢ statistic on [log(wage)|° is about —1.25, so we would be justified 
in dropping it from the labor supply function. Regressing the 2SLS residuals a on all 
variables used as instruments in the supply equation gives R-squared = .0061, and so 
the N-R-squared statistic is 2.61. With a y distribution this gives p-value = .456. 
Thus, we fail to reject the overidentifying restrictions. 


In the previous example we may be tempted to estimate the labor supply function 
using a two-step procedure that appears to mimic 2SLS: 


1. Regress log(wage) on all exogenous variables appearing in equations (9.61) and 
(9.62) and obtain the predicted values. For emphasis, call these j. 


2. Estimate the labor supply function from the OLS regression hours on 1, }5, (2)°, 
educ, ... ,nwifeinc. 


This two-step procedure is not the same as estimating equation (9.61) by 2SLS, 
and, except in special circumstances, it does not produce consistent estimators of the 
structural parameters. The regression in step 2 is an example of what is sometimes 
called a forbidden regression, a phrase that describes replacing a nonlinear function of 
an endogenous explanatory variable with the same nonlinear function of fitted values 
from a first-stage estimation. In plugging fitted values into equation (9.61), our mis- 
take is in thinking that the linear projection of the square is the square of the linear 
projection. What the 2SLS estimator does in the first stage is project each of y, and 
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y3 onto the original exogenous variables and the additional nonlinear functions of 
these that we have chosen. The fitted values from the reduced form regression for y3, 
say )3, are not the same as the squared fitted values from the reduced form regression 
for y», (~3)°. This distinction is the difference between a consistent estimator and an 
inconsistent estimator. 

If we apply the forbidden regression to equation (9.61), some of the estimates are 
very different from the 2SLS estimates. For example, the coefficient on educ, when 
equation (9.61) is properly estimated by 2SLS, is about —87.85 with a f statistic of 
—1.32. The forbidden regression gives a coefficient on educ of about —176.68 with a t 
statistic of —5.36. Unfortunately, the ¢ statistic from the forbidden regression is gen- 
erally invalid, even asymptotically. (The forbidden regression will produce consistent 
estimators in the special case y,; = 0 if E(w|z) = 0; see Problem 9.12.) 

Many more functions of the exogenous variables could be added to the instrument 
list in estimating the labor supply function. From Chapter 8, we know that efficiency 
of GMM never falls by adding more nonlinear functions of the exogenous variables 
to the instrument list (even under the homoskedasticity assumption). This statement 
is true whether we use a single-equation or system method. Unfortunately, the fact 
that we do no worse asymptotically by adding instruments is of limited practical help, 
since we do not want to use too many instruments for a given data set. In Example 
9.6, rather than using a long list of additional nonlinear functions, we might use O) 
as a single IV for y3. (This method is not the same as the forbidden regression!) If it 
happens that y,, = 0 and the structural errors are homoskedastic, this would be the 
optimal IV. (See Problem 9.12.) 

A general system linear in parameters can be written as 


Yı = qı (y, ZB) + u 
(9.63) 
Ve =4clY,2)Bg + ug, 


where E(u, |z) = 0, g = 1,2,...,G. Among other things this system allows for com- 
plicated interactions among endogenous and exogenous variables. We will not give a 
general analysis of such systems because identification and choice of instruments are 
too abstract to be very useful. Either single-equation or system methods can be used 
for estimation. 


9.5.3 Control Function Estimation for Triangular Systems 


A triangular system of equations is similar to a recursive system, defined in Section 
9.4.2, except that we now allow several endogenous variables to be determined by 
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only exogenous variables (and errors) and then assume those appear in a set of 
equations for another set of endogenous variables. Allowing for nonlinear functions 
of endogenous and exogenous variables, we can write 


yı = Fi (y2, z); +, (9.64) 
Yo = Fo(z)B, + w, (9.65) 


where y, is G x 1, y, is G2 x 1, and the functions F; (-) and F2(-) are known matrix 
functions. We maintain the exogeneity assumptions 


E(u; |z) = 0, E(w |z) = 0, (9.66) 


without which nonlinear systems are rarely identified. Typically our interest is in 
(9.64). In fact, (9.65) is often a set of reduced form equations; in any case, 
E(y,|z) = F(z). A potentially important point is that a nonlinear structural model, 
where (9.64) is augmented with an equation where y, is a function of y,, could rarely 
be solved to produce (9.65) with known F)(-) and an additive error with zero condi- 
tional mean. Nevertheless, we might simply think of (9.65), with F,(-) sufficiently 
flexible, as a way to approximate E(y, | z). 

Without further assumptions, we can estimate (9.64) and (9.65) by GMM methods, 
provided (9.64) is identified. (Remember, unless F(z) contains exact linear depen- 
dencies, (9.65) would always be identified.) Given F,(-) and F2(-), instruments might 
suggest themselves, or we might use polynomials of low orders. Here, we discuss a 
control function approach under an additional assumption, a special case of which 
we saw in Section 6.2: 


E(u; | w,z) = E(u; | u2). (9.67) 
A sufficient condition for (9.67) is that (u;, u2) is independent of z, but independence 
is a pretty strong assumption, especially if (9.65) is simply supposed to be a way to 
specify a model for E(y, |z). Assuming that (9.65) holds with u independent of z 
effectively rules out discreteness in the elements of y), as will become clear in Part IV 


of the text. But sometimes (9.67) is reasonable. 
The power of (9.67) is that we can write 


E(y; | y2,2) = E(y; |w, z) = Fı (y2, z); + E(u: | w, z) 
= Fi (yp,z)B, + E(u |w) = Fi (y2, z)B, + Hı (w). (9.68) 


Therefore, if we knew H;(-) and could observe uy, we could estimate f} by a system 
estimation method, perhaps system OLS or FGLS. We might be willing to assume 
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H,(-) is linear, so E(u; |w) = Aw = (u; @ Ig,) vec(A;) = U2d). Then, in a first 
stage, we can estimate (9.65) by a standard system method and obtain £, and the 
residuals, tj2. Next, form Un = (û; @ Ic,) and estimate 


Ya = Fap + Up + error; (9.69) 


using a system estimation method for exogenous variables (OLS or FGLS), where 
F; = Fi (yn, Zi) and error; includes the estimation error from having to estimate $, 
in the first stage. As in Chapter 6, the asymptotic variance matrix should be adjusted 
for the two-step estimation. Alternatively, we will show how to use a more general 
GMM setup in Chapter 14 to obtain the asymptotic variance from a larger GMM 
problem. 

As a specific example, suppose we are interested in the single equation 


Vi = 210) + X112 + 123 + O13 V2V3 + 2101, + V3Z17 12 + U1, (9.70) 


which allows for interactive effects among endogenous variables and for each 
endogenous variable and exogenous variables in zı. We assume that y2 and y3 have 
reduced forms 


V2 = 7B, + un, V3 = ZB; + us, (9.71) 


where all errors have zero conditional means. (The vector z might include nonlinear 
functions of the exogenous variables.) If, in addition, 


E(u | Z, U2, u3) = Ayu + 1423 (9.72) 
then the OLS regression 
Yi ON Zil, Vi2, VB, V2Vi3, YünZil, VaZi, U2, Up, i=1,...,N (9.73) 


consistently estimates the parameters, where û; and û; are the OLS residuals from 
the first-stage regressions. Notice that we only have to add two control functions to 
account for the endogeneity of y2, v3, y2¥3, 2%, and y3z,. Of course, this control 
function approach uses assumption (9.72); otherwise it is generally inconsistent. The 
control function approach can be made more flexible by adding, say, the interaction 
inti; to (9.73), and maybe even #7, and ù% (especially if y} and yz were in the 
model). The standard errors of the structural estimates should generally account for 
the first-stage estimation, as discussed in Chapter 6. A simple test of the null hy- 
pothesis that y2 and y3 are exogenous is that all terms involving ûn and ů; are 
jointly insignificant; under the null hypothesis, we can ignore the generated regressor 
problem. 
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As discussed in Section 6.2, the control function approach for nonlinear models is 
less robust than applying IV methods directly to equation (9.64) (with equation (9.70) 
as a special case), but the CF estimator can be more precise, especially if we can get 
by with few functions of the first-stage residuals. Plus, as shown by Newey, Powell, 
and Vella (1999), one can extend the CF approach without imposing functional form 
assumptions at all. Such nonparametric methods are beyond the scope of this text, 
but they are easily motivated by equation (9.68). In practice, choosing interactions, 
quadratics, and maybe higher-order polynomials (along with judicious taking of 
logarithms) might be sufficient. 


9.6 Different Instruments for Different Equations 


There are general classes of SEMs where the same instruments cannot be used for 
every equation. We already encountered one such example, the fully recursive sys- 
tem. Another general class of models is SEMs where, in addition to simultaneous 
determination of some variables, some equations contain variables that are endoge- 
nous as a result of omitted variables or measurement error. 

As an example, reconsider the labor supply and wage offer equations (9.28) and 
(9.62), respectively. On the one hand, in the supply function it is not unreasonable to 
assume that variables other than log(wage) are uncorrelated with u1. On the other 
hand, ability is a variable omitted from the log(wage) equation, and so educ might 
be correlated with u2. This is an omitted variable, not a simultaneity, issue, but the 
statistical problem is the same: correlation between the error and an explanatory 
variable. 

Equation (9.28) is still identified as it was before, because educ is exogenous in 
equation (9.28). What about equation (9.62)? It satisfies the order condition because 
we have excluded four exogenous variables from equation (9.62): age, kids/t6, 
kidsge6, and nwifeinc. How can we analyze the rank condition for this equation? We 
need to add to the system the linear projection of educ on all exogenous variables: 


educ = 639 + 63,exper + 032exper* + 633age 
+ O34kidslt6 + 63skidsge6 + dx6nwifeinc + u3. (9.74) 


Provided the variables other than exper and exper? are sufficiently partially corre- 
lated with educ, the log(wage) equation is identified. However, the 2SLS estimators 
might be poorly behaved if the instruments are not very good. If possible, we would 
add other exogenous factors to equation (9.74) that are partially correlated with educ, 
such as mother’s and father’s education. In a system procedure, because we have 
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assumed that educ is uncorrelated with u1, educ can, and should, be included in the 
list of instruments for estimating equation (9.28). 

This example shows that having different instruments for different equations 
changes nothing for single-equation analysis: we simply determine the valid list of 
instruments for the endogenous variables in the equation of interest and then estimate 
the equations separately by 2SLS. Instruments may be required to deal with simul- 
taneity, omitted variables, or measurement error, in any combination. 

Estimation is more complicated for system methods. First, if 3SLS is to be used, 
then the GMM 3SLS version must be used to produce consistent estimators of any 
equation; the more traditional 3SLS estimator discussed in Section 8.3.5 is generally 
valid only when all instruments are uncorrelated with all errors. When we have dif- 
ferent instruments for different equations, the instrument matrix has the form in 
equation (8.15). 

There is a more subtle issue that arises in system analysis with different instruments 
for different equations. While it is still popular to use 3SLS methods for such prob- 
lems, it turns out that the key assumption that makes 3SLS the efficient GMM esti- 
mator, Assumption SIV.5, is often violated. In such cases the GMM estimator with 
general weighting matrix enhances asymptotic efficiency and simplifies inference. 

As a simple example, consider a two-equation system 


Yı =O10 + Y12V2 + O21 + U1, (9.75) 


Vz = 629 + Y1 V1 + 62222 + 62323 + Ur, (9.76) 


where (u, u2) has mean zero and variance matrix È. Suppose that z1, z2, and z3 are 
uncorrelated with u but we can only assume that zı and z3 are uncorrelated with u. 
In other words, z2 is not exogenous in equation (9.75). Each equation is still identified 
by the order condition, and we just assume that the rank conditions also hold. The 
instruments for equation (9.75) are (1,21, 23), and the instruments for equation (9.76) 
are (1,2),20,23). Write these as zı = (1,2),23) and z2 = (1,2),2,23). Assumption 
SIV.5 requires the following three conditions: 


E(ujz,z1) = o E(z{21). (9.77) 
E(u3z\22) = o5E(z522). (9.78) 
E(u u2Z}Z2) = o12E(z}22). (9.79) 


The first two conditions hold if E(w;|z1) = E(w|22) =0 and Var(u|z1) = 0f, 
Var(u | z2) = 05. These are standard zero conditional mean and homoskedasticity 
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assumptions. The potential problem comes with condition (9.79). Since u; is corre- 
lated with one of the elements in z2, we can hardly just assume condition (9.79). 
Generally, there is no conditioning argument that implies condition (9.79). One 
case where condition (9.79) holds is if E(w | u1, 21, 22,23) = 0, which implies that u2 
and u; are uncorrelated. The left-hand side of condition (9.79) is also easily shown 
to equal zero. But 3SLS with c2 = 0 imposed is just 2SLS equation by equation. 
If u; and u are correlated, we should not expect condition (9.79) to hold, and there- 
fore the general minimum chi-square estimator should be used for estimation and 
inference. 

Wooldridge (1996) provides a general discussion and contains other examples of 
cases in which Assumption SIV.5 can and cannot be expected to hold. Whenever a 
system contains linear projections for nonlinear functions of endogenous variables, 
we should expect Assumption SIV.5 to fail. 


Problems 


9.1. Discuss whether each example satisfies the autonomy requirement for true 
simultaneous equations analysis. The specification of y; and y, means that each is to 
be written as a function of the other in a two-equation system. 


a. For an employee, y, = hourly wage, y, = hourly fringe benefits. 


b. At the city level, yı = per capita crime rate, y, = per capita law enforcement 
expenditures. 


c. For a firm operating in a developing country, y, = firm research and development 
expenditures, y, = firm foreign technology purchases. 


d. For an individual, yı = hourly wage, y, = alcohol consumption. 
e. For a family, yı = annual housing expenditures, y, = annual savings. 
f. For a profit maximizing firm, yı = price markup, y) = advertising expenditures. 


g. For a single-output firm, yı = quantity demanded of its good, y, = advertising 
expenditure. 


h. At the city level, y; = incidence of HIV, y, = per capita condom sales. 


9.2. Write a two-equation system in the form 


Yı = M2 + Zaj) + u, 


Y2 = 2V1 + 2(2)6(2) + u2. 
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a. Show that reduced forms exist if and only if y,j) # 1. 


b. State in words the rank condition for identifying each equation. 


9.3. The following model jointly determines monthly child support payments and 
monthly visitation rights for divorced couples with children: 


support = 019 + yy visits + 611 finc + 612 fremarr + 613dist + u, 


visits = ô + Support + 62;mremarr + ðndist + u2. 


For expository purposes, assume that children live with their mothers, so that fathers 
pay child support. Thus, the first equation is the father’s “reaction function”: it 
describes the amount of child support paid for any given level of visitation rights and 
the other exogenous variables finc (father’s income), fremarr (binary indicator if 
father remarried), and dist (miles currently between the mother and father). Similarly, 
the second equation is the mother’s reaction function: it describes visitation rights for 
a given amount of child support; mremarr is a binary indicator for whether the 
mother is remarried. 


a. Discuss identification of each equation. 

b. How would you estimate each equation using a single-equation method? 

c. How would you test for endogeneity of visits in the father’s reaction function? 

d. How many overidentification restrictions are there in the mother’s reaction func- 
tion? Explain how to test the overidentifying restriction(s). 


9.4. Consider the following three-equation structural model: 


Yı = 12 V2 + 01121 + 01222 + 01323 + U1, 


Vi = Vo2 V2 + Y23V3 + 62121 + u2, 
Y3 = 03121 + 03222 + 03323 + U3, 


where zı = 1 (to allow an intercept), E(u,) = 0, all g, and each z; is uncorrelated with 
each ug. You might think of the first two equations as demand and supply equations, 
where the supply equation depends on a possibly endogenous variable y, (such as 
wage costs) that might be correlated with u2. For example, uz might contain mana- 
gerial quality. 

a. Show that a well-defined reduced form exists as long as 7}. 4 yo. 


b. Allowing for the structural errors to be arbitrarily correlated, determine which of 
these equations is identified. First consider the order condition, and then the rank 
condition. 
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9.5. The following three-equation structural model describes a population: 
Yı = V122 +1393 + 1121 + 61323 + 61424 + U1, 


Y2 = Yay + 62121 + u2, 


Y3 = 03121 + 03222 + 03323 + 03424 + U3, 


where you may set z; = 1 to allow an intercept. Make the usual assumptions that 
E(u,) = 0, g = 1,2,3 and that each z; is uncorrelated with each ug. In addition to the 
exclusion restrictions that have already been imposed, assume that 6)3 + 014 = 1. 


a. Check the order and rank conditions for the first equation. Determine necessary 
and sufficient conditions for the rank condition to hold. 

b. Assuming that the first equation is identified, propose a single-equation estimation 
method with all restrictions imposed. Be very precise. 


9.6. The following two-equation model contains an interaction between an endog- 
enous and exogenous variable (see Example 6.2 for such a model in an omitted 
variable context): 


Yı = O10 + V122 + 139221 + 611721 + 61222 + U1, 
y= 029 + Y21 Y1 + ô21Z1 +6323 + U2. 
a. Initially, assume that y,, = 0, so that the model is a linear SEM. Discuss identifi- 


cation of each equation in this case. 


b. For any value of 7,3, find the reduced form for yı (assuming it exists) in terms of 
the z;, the ug, and the parameters. 


c. Assuming that E(u|z) = E(u |z) = 0, find E(y, |z). 
d. Argue that, under the conditions in part a, the model is identified regardless of the 
value of 3. 


e. Suggest a 2SLS procedure for estimating the first equation. 

f. Define a matrix of instruments suitable for 3SLS estimation. 

g. Suppose that 62; = 0, but we also know that y,; #0. Can the parameters in the 
first equation be consistently estimated? If so, how? Can Ho: y;3; = 0 be tested? 


9.7. Assume that wage and alcohol consumption are determined by the system 


wage = alcohol + y,3educ + 21)6q) + u, 


alcohol = yx,wage + Yr3educ + 2(2)6(2) + u, 
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educ = 2(3)6(3) + u3. 


The third equation is a reduced form for years of education. 

Elements in zq) include a constant, experience, gender, marital status, and amount 
of job training. The vector z(2) contains a constant, experience, gender, marital status, 
and local prices (including taxes) on various alcoholic beverages. The vector 2/3) can 
contain elements in z(;) and Z2) and, in addition, exogenous factors affecting educa- 
tion; for concreteness, suppose one element of zg) is distance to nearest college at age 
16. Let z denote the vector containing all nonredundant elements of z(1), 22), and 2/3). 
In addition to assuming that z is uncorrelated with each of u1, u2, and u3, assume that 
educ is uncorrelated with u2, but educ might be correlated with u1. 


a. When does the order condition hold for the first equation? 


b. State carefully how you would estimate the first equation using a single-equation 
method. 


c. For each observation i define the matrix of instruments for system estimation of 
all three equations. 


d. In a system procedure, how should you choose z) to make the analysis as robust 
as possible to factors appearing in the reduced form for educ? 


9.8. a. Extend Problem 5.4b using CARD.RAW to allow educ? to appear in the 
log(wage) equation, without using nearc2 as an instrument. Specifically, use inter- 
actions of nearc4 with some or all of the other exogenous variables in the log(wage) 
equation as instruments for educ*. Compute a heteroskedasticity-robust test to be 
sure that at least one of these additional instruments appears in the linear projection 
of educ? onto your entire list of instruments. Test whether educ? needs to be in the 
log(wage) equation. 

b. Start again with the model estimated in Problem 5.4b, but suppose we add the 
interaction black-educ. Explain why black-z; is a potential IV for black-educ, where z; 
is any exogenous variable in the system (including nearc4). 


c. In Example 6.2 we used black-nearc4 as the IV for black-educ. Now use 2SLS with 
black-educ as the IV for black-educ, where educ are the fitted values from the first- 
stage regression of educ on all exogenous variables (including nearc4). What do you 
find? 

d. If E(educ|z) is linear and Var(w|z) = 07, where z is the set of all exogenous 
variables and wu is the error in the log(wage) equation, explain why the estimator 
using black-educ as the IV is asymptotically more efficient than the estimator using 
black-nearc4 as the IV. 
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9.9. Use the data in MROZ.RAW for this question. 


a. Estimate equations (9.28) and (9.29) jointly by 3SLS, and compare the 3SLS esti- 
mates with the 2SLS estimates for equations (9.28) and (9.29). 

b. Now allow educ to be endogenous in equation (9.29), but assume it is exogenous 
in equation (9.28). Estimate a three-equation system using different instruments for 
different equations, where motheduc, fatheduc, and huseduc are assumed exogenous in 
equations (9.28) and (9.29). 


9.10. Consider a two-equation system of the form 


Yı = M12 + 210, + u1, 


Y2: 5 Z202 + U2. 


Assume that zı contains at least one element not also in z2, and z2 contains at least 
one element not in zı. The second equation is also the reduced form for y,, but 
restrictions have been imposed to make it a structural equation. (For example, it 
could be a wage offer equation with exclusion restrictions imposed, whereas the first 
equation is a labor supply function.) 


a. If we estimate the first equation by 2SLS using all exogenous variables as IVs, are 
we imposing the exclusion restrictions in the second equation? (Hint: Does the first- 
stage regression in 2SLS impose any restrictions on the reduced form?) 


b. Will the 3SLS estimates of the first equation be the same as the 2SLS estimates? 
Explain. 


c. Explain why 2SLS is more robust than 3SLS for estimating the parameters of the 
first equation. 


9.11. Consider a two-equation SEM: 


Yı = V2 +6121 + M1, 


Y2 = Yayi + 2222 + 62323 + ua, 
E(uj| 21, 22,23) = E(u | 21,22, 23) = 0, 
where, for simplicity, we omit intercepts. The exogenous variable z; is a policy vari- 


able, such as a tax rate. Assume that y,.y, # 1. The structural errors, u; and u2, may 
be correlated. 


a. Under what assumptions is each equation identified? 


b. The reduced form for y; can be written in conditional expectation form as E( y; | z) = 
M1121 + 71222 + 11323, where z = (21, 22,23). Find 71 in terms of the Vai and d,;. 
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c. How would you estimate the structural parameters? How would you obtain 7; in 
terms of the structural parameter estimates? 

d. Suppose that z should be in the first equation, but it is left out in the estimation 
from part c. What effect does this omission have on estimating 0E(y, | z)/0z;? Does 
it matter whether you use single-equation or system estimators of the structural 
parameters? 

e. If you are only interested in ĝE( y; |z)/0z1, what could you do instead of estimat- 
ing an SEM? 

f. Would you say estimating a simultaneous equations model is a robust method for 
estimating 0E(y, | z)/0z1? Explain. 


9.12. The following is a two-equation, nonlinear SEM: 


Yı = 610 + V122 + V133 + 161 + u1, 


Y2 = 629 + Y12V1 + 2202 + u2, 


where u; and u have zero means conditional on all exogenous variables, z. (For 
emphasis, we have included separate intercepts.) Assume that both equations are 
identified when yı; = 0. 

a. When 7,3 = 0, E(y2|z) = m + zm. What is E(y3|z) under homoskedasticity 
assumptions for u; and u2? 

b. Use part a to find E(y, |z) when yı; = 0. 

c. Use part b to argue that, when y,, = 0, the forbidden regression consistently esti- 
mates the parameters in the first equation, including y,, = 0. 

d. If u; and u have constant variances conditional on z, and 7,3 happens to be zero, 
show that the optimal instrumental variables for estimating the first equation are 
{1,z, [E(y,|z)]’}. (Hint: Use Theorem 8.5; for a similar problem, see Problem 8.11.) 
e. Reestimate equation (9.61) using IVs [1,z, (j)°], where z is all exogenous vari- 
ables appearing in equations (9.61) and (9.62) and f, denotes the fitted values from 
regressing log(wage) on 1, z. Discuss the results. 


9.13. For this question use the data in OPENNESS.RAW, taken from Romer (1993). 


a. A simple simultaneous equations model to test whether “openness” (open) leads to 
lower inflation rates (inf) is 


inf = ĉio + ypopen + 01, log(pcinc) + u1, 


open = ô% + 7, inf +02, log( pcinc) + 62 log(land) + up. 
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Assuming that pcinc (per capita income) and /and (land area) are exogenous, under 
what assumption is the first equation identified? 


b. Estimate the reduced form for open to verify that log(/and) is statistically significant. 


c. Estimate the first equation from part a by 2SLS. Compare the estimate of y,;. with 
the OLS estimate. 


d. Add the term y,,0pen" to the first equation, and propose a way to test whether it is 
statistically significant. (Use only one more IV than you used in part c.) 


e. With y,,0pen? in the first equation, use the following method to estimate 610, 717, 
¥13, and di: (1) Regress open on 1, log(pcinc) and log(/and), and obtain the fitted 
values, open. (2) Regress inf on 1, open, (open), and log( pcinc). Compare the results 
with those from part d. Which estimates do you prefer? 


9.14. Answer “agree” or “disagree” to each of the following statements, and pro- 
vide justification. 


a. “In the general SEM (9.13), if the matrix of structural parameters T is identified, 
then so is the variance matrix È of the structural errors.” 


b. “Identification in a nonlinear structural equation when the errors are independent 
of the exogenous variables can always be assured by adding enough nonlinear func- 
tions of exogenous variables to the instrument list.” 


c. “If in a system of equations E(u|z) = 0 but Var(u|z) # Var(u), the 3SLS esti- 
mator is inconsistent.” 


d. “In a well-specified simultaneous equations model, any variable exogenous in one 
equation should be exogenous in all equations.” 


e. “In a triangular system of three equations, control function methods are preferred 
to standard IV approaches.” 


9.15. Consider the triangular system in equations (9.70) and (9.71) under the as- 
sumption E(u; |z) = 0, where z (1 x L) contains all exogenous variables in the three 
equations and z; is | x L4, with Lı < L—2. 


a. Suppose that the model is identified when «13 = 0, yı; = 0, and y»,, = 0. Argue 
that the general model is identified. 


b. Let fp and f; be the first-stage fitted values from the regressions yj on z; and 
yə ON Z;, respectively. Suppose you estimate (9.70) by IV using instruments 
(Zit, Vins Vian Yna, Viki, Jazzi). Are there any overidentification restrictions to test? 
Explain. 
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c. Suppose L> Lı+2 and you estimate (9.70) by 2SLS using instruments 
(Zi, ofig, Vio%i, faza). How many overidentifying restrictions are there, and how 
would you test them? 

d. Propose a method that gives even more overidentifying restrictions. 

e. If you add the assumptions E(uj | z;) = E(u; |z;) = 0 and E(u? |z;) = 07, argue 
that the estimator from part b is an asymptotically efficient IV estimator. What can 
you conclude about the estimators from parts c and d? 


l () Basic Linear Unobserved Effects Panel Data Models 


In Chapter 7 we covered a class of linear panel data models where, at a minimum, the 
error in each time period was assumed to be uncorrelated with the explanatory vari- 
ables in the same time period. For certain panel data applications this assumption is 
too strong. In fact, a primary motivation for using panel data is to solve the omitted 
variables problem. 

In this chapter we study population models that explicitly contain a time-constant, 
unobserved effect. The treatment in this chapter is “modern” in the sense that unob- 
served effects are treated as random variables, drawn from the population along with 
the observed explained and explanatory variables, as opposed to parameters to be 
estimated. In this framework, the key issue is whether the unobserved effect is un- 
correlated with the explanatory variables. 


10.1 Motivation: Omitted Variables Problem 


It is easy to see how panel data can be used, at least under certain assumptions, to 
obtain consistent estimators in the presence of omitted variables. Let y and x = 
(x1, X2,..., Xg) be observable random variables, and let c be an unobservable ran- 
dom variable; the vector (y, x1, X2, . . . , Xg, C) represents the population of interest. As 
is often the case in applied econometrics, we are interested in the partial effects of the 
observable explanatory variables x; in the population regression function 


E(y 


In words, we would like to hold c constant when obtaining partial effects of the ob- 
servable explanatory variables. We follow Chamberlain (1984) in using c to denote 
the unobserved variable. Much of the panel data literature uses a Greek letter, such 
as « or g, but we want to emphasize that the unobservable is a random variable, not a 
parameter to be estimated. (We discuss this point further in Section 10.2.1.) 
Assuming a linear model, with c entering additively along with the x;, we have 


E(y|x,c) = fp + x$ +c, (10.2) 


X1,X2,-.-,XK,C). (10.1) 


where interest lies in the K x 1 vector f. On the one hand, if c is uncorrelated with 
each x;, then c is just another unobserved factor affecting y that is not systematically 
related to the observable explanatory variables whose effects are of interest. On the 
other hand, if Cov(x;,c) #0 for some j, putting c into the error term can cause 
serious problems. Without additional information we cannot consistently estimate £, 
nor will we be able to determine whether there is a problem (except by introspection, 
or by concluding that the estimates of f are somehow “unreasonable’’). 
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Under additional assumptions there are ways to address the problem Cov(x, c) 
#0. We have covered at least three possibilities in the context of cross section anal- 
ysis: (1) we might be able to find a suitable proxy variable for c, in which case we can 
estimate an equation by OLS where the proxy is plugged in for c; (2) for the elements 
of x that are correlated with c, we may be able to find instrumental variables, in 
which case we can use an IV method such as 2SLS; or (3) we may be able to find 
indicators of c that can then be used in multiple indicator instrumental variables 
procedure. These solutions are covered in Chapters 4 and 5. 

If we have access to only a single cross section of observations, then the three 
remedies listed, or slight variants of them, largely exhaust the possibilities. However, 
if we can observe the same cross section units at different points in time—that is, if 
we can collect a panel data set—then other possibilties arise. 

For illustration, suppose we can observe y and x at two different time periods; call 
these y,, x, for t = 1,2. The population now represents two time periods on the same 
unit. Also, suppose that the omitted variable c is time constant. Then we are inter- 
ested in the population regression function 


E(y,| xac) =o txBPte, t=1,2, (10.3) 


where x,B = pixa +++- + BeXix and xy indicates variable j at time t. Model (10.3) 
assumes that c has the same effect on the mean response in each time period. Without 
loss of generality, we set the coefficient on c equal to one. (Because c is unobserved 
and virtually never has a natural unit of measurement, it would be meaningless to try 
to estimate its partial effect.) 

The assumption that c is constant over time, and has a constant partial effect over 
time, is crucial to the following analysis. An unobserved, time-constant variable is 
called an unobserved effect in panel data analysis. When f¢ represents different time 
periods for the same individual, the unobserved effect is often interpreted as captur- 
ing features of an individual, such as cognitive ability, motivation, or early family 
upbringing, that are given and do not change over time. Similarly, if the unit of ob- 
servation is the firm, c contains unobserved firm characteristics—such as managerial 
quality or structure—that can be viewed as being (roughly) constant over the period 
in question. We cover several specific examples of unobserved effects models in Sec- 
tion 10.2. 

To discuss the additional assumptions sufficient to estimate £, it is useful to write 
model (10.3) in error form as 


Yi = Po + Xip + c + us, (10.4) 


where, by definition, 
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E(u; | X1, c) = 0, t= 1,2. (10.5) 
One implication of condition (10.5) is 
E(xiu;) = 0, t= 1,2. (10.6) 


If we were to assume E(x/c) = 0, we could apply pooled OLS, as we covered in 
Section 7.8. If c is correlated with any element of x,, then pooled OLS is biased and 
inconsistent. 

With two years of data we can difference equation (10.4) across the two time periods 
to eliminate the time-constant unobservable, c. Define Ay = y, — yı, Ax = X? — X], 
and Au = u — u. Then, differencing equation (10.4) gives 


Ay = Axf + Au, (10.7) 


which is just a standard linear model in the differences of all variables (although the 
intercept has dropped out). Importantly, the parameter vector of interest, B, appears 
directly in equation (10.7), and its presence suggests estimating equation (10.7) by 
OLS. Given a panel data set with two time periods, equation (10.7) is just a standard 
cross section equation. Under what assumptions will the OLS estimator from equa- 
tion (10.7) be consistent? 

Because we assume a random sample from the population, we can apply the results 
in Chapter 4 directly to equation (10.7). The key conditions for OLS to consistently 
estimate # are the orthogonality condition (Assumption OLS.1) 


E(Ax‘Au) = 0 (10.8) 
and the rank condition (Assumption OLS.2) 
rank E(Ax’Ax) = K. (10.9) 


Consider condition (10.8) first. It is equivalent to E[(x2 — x1)/(u2 — u1)] = 9, or, after 
simple algebra, 


E(xju2) + E(xju1) — E(xju2) — E(xju) = 0. (10.10) 


The first two terms in equation (10.10) are zero by condition (10.6), which holds for 
t= 1,2. But condition (10.5) does not guarantee that x; and u2 are uncorrelated or 
that x. and uw are uncorrelated. It might be reasonable to assume that condition 
(10.8) holds, but we must recognize that it does not follow from condition (10.5). 
Assuming that the error u, is uncorrelated with x; and x2 for t = 1,2 is an example of 
a strict exogeneity assumption on the regressors in unobserved components panel data 
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models. We discuss strict exogeneity assumptions generally in Section 10.2. For now, 
we emphasize that assuming Cov(x,, us) = 0 for all ż and s puts no restrictions on the 
correlation between x, and the unobserved effect, c. 

The second assumption, condition (10.9), also deserves some attention now be- 
cause the elements of x, appearing in structural equation (10.3) have been differenced 
across time. If x, contains a variable that is constant across time for every member of 
the population, then Ax contains an entry that is identically zero, and condition 
(10.9) fails. This outcome is not surprising: if c is allowed to be arbitrarily correlated 
with the elements of x,, the effect of any variable that is constant across time cannot 
be distinguished from the effect of c. Therefore, we can consistently estimate f; only if 
there is some variation in xy over time. 

In the remainder of this chapter, we cover various ways of dealing with the pres- 
ence of unobserved effects under different sets of assumptions. We assume we have 
repeated observations on a cross section of N individuals, families, firms, school dis- 
tricts, cities, or some other economic unit. As in Chapter 7, we assume in this chapter 
that we have the same time periods, denoted t= 1,2,...,7, for each cross section 
observation. Such a data set is usually called a balanced panel because the same time 
periods are available for all cross section units. While the mechanics of the unbal- 
anced case are similar to the balanced case, a careful treatment of the unbalanced 
case requires a formal description of why the panel may be unbalanced, and the 
sample selection issues can be somewhat subtle. Therefore, we hold off covering un- 
balanced panels until Chapter 21, where we discuss sample selection and attrition 
issues. 

We still focus on asymptotic properties of estimators, where the time dimension, T, 
is fixed and the cross section dimension, N, grows without bound. With large-N 
asymptotics it is convenient to view the cross section observations as independent, 
identically distributed draws from the population. For any cross section observation 
i—denoting a single individual, firm, city, and so on—we denote the observable 
variables for all T time periods by {( yp, Xx) : t= 1,2,..., T}. Because of the fixed T 
assumption, the asymptotic analysis is valid for arbitrary time dependence and dis- 
tributional heterogeneity across t. 

When applying asymptotic analysis to panel data methods, it is important to re- 
member that asymptotics are useful insofar as they provide a reasonable approxi- 
mation to the finite sample properties of estimators and statistics. For example, a 
priori it is difficult to know whether N — œo asymptotics works well with, say, 
N = 50 states in the United States and T = 8 years. But we can be pretty confident 
that N — co asymptotics are more appropriate than T — co asymptotics, even 
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though N is practically fixed while T can grow. With large geographical regions, the 
random sampling assumption in the cross section dimension is conceptually flawed. 
Nevertheless, if N is sufficiently large relative to T, and if we can assume rough in- 
dependence in the cross section, then our asymptotic analysis should provide suitable 
approximations. 

If T is of the same order as N—for example, N = 60 countries and T = 55 post- 
World War II years—an asymptotic analysis that makes explicit assumptions about 
the nature of the time series dependence is needed. (In special cases, the conclusions 
about consistent estimation and approximate normality of ¢ statistics will be the 
same, but not generally.) This area is still relatively new. If T is much larger than N, 
say N = 5 companies and T = 40 years, the framework becomes multiple time series 
analysis: N can be held fixed while T — co. We do not cover time series analysis. 


10.2 Assumptions about the Unobserved Effects and Explanatory Variables 


Before analyzing panel data estimation methods in more detail, it is useful to gener- 
ally discuss the nature of the unobserved effects and certain features of the observed 
explanatory variables. 


10.2.1 Random or Fixed Effects? 


The basic unobserved effects model (UEM) can be written, for a randomly drawn 
cross section observation i, as 


Vie = Xuß + Ci + uin, $= l2 oral; (10.11) 


where x; 1s 1 x K and can contain observable variables that change across t but not i, 
variables that change across i but not f, and variables that change across i and t. In 
addition to unobserved effect, there are many other names given to c; in applications: 
unobserved component, latent variable, and unobserved heterogeneity are common. If i 
indexes individuals, then c; is sometimes called an individual effect or individual het- 
erogeneity; analogous terms apply to families, firms, cities, and other cross-sectional 
units. The uy are called the idiosyncratic errors or idiosyncratic disturbances because 
these change across ź as well as across i. 

Especially in methodological papers, but also in applications, one often sees a dis- 
cussion about whether c; will be treated as a random effect or a fixed effect. Origi- 
nally, such discussions centered on whether c; is properly viewed as a random variable 
or as a parameter to be estimated. In the traditional approach to panel data models, 
ci is called a “random effect” when it is treated as a random variable and a “fixed 
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effect” when it is treated as a parameter to be estimated for each cross section ob- 
servation 7. Our view is that discussions about whether the c; should be treated as 
random variables or as parameters to be estimated are wrongheaded for micro- 
econometric panel data applications. With a large number of random draws from the 
cross section, it almost always makes sense to treat the unobserved effects, c;, as 
random draws from the population, along with y; and xj. This approach is certainly 
appropriate from an omitted variables or neglected heterogeneity perspective. As our 
discussion in Section 10.1 suggests, the key issue involving c; is whether or not it is 
uncorrelated with the observed explanatory variables x;,, t= 1,2,...,7. Mundlak 
(1978) made this argument many years ago, and it still is persuasive. 

In modern econometric parlance, a random effects framework is synonymous with 
zero correlation between the observed explanatory variables and the unobserved ef- 
fect: Cov(xi,c;) = 0, t=1,2,...,7. (Actually, a stronger conditional mean inde- 
pendence assumption, E(c;|xj,...,xir) = E(c;), will be needed to fully justify 
statistical inference; more on this subject in Section 10.4.) In applied papers, when c; 
is referred to as, say, an “individual random effect,” then c; is probably being 
assumed to be uncorrelated with the x;;. 

In most microeconometric applications, a fixed effects framework does not actually 
mean that c; is being treated as nonrandom; rather, it means that one is allowing 
for arbitrary dependence between the unobserved effect c; and the observed explan- 
atory variables xj. So, if c; is called an “individual fixed effect” or a “firm fixed 
effect,” then, for practical purposes, this terminology means that c; is allowed to be 
correlated arbitrarily with x;;. 

More recently, another concept has cropped up for describing situations where we 
allow dependence between c; and {x;:t=1,...,T}, especially, but not only, for 
nonlinear models. (See, for example, Cameron and Trivedi (2005, pp. 719, 786).) If 
we model the dependence between c; and x; = (Xj, Xi2, -< . , Xir) —or, more generally, 
place substantive restrictions on the distribution of c; given x;—then we are using a 
correlated random effects (CRE) framework. (The name refers to allowing correlation 
between c; and x;.) Practically, the key difference between a fixed effects approach 
and a correlated random effects approach is that in the former case the relationship 
between c; and x; is left entirely unspecified, while in the latter case we restrict this 
dependence in some way—sometimes rather severely, in the case of nonlinear models 
(as we will see in Part IV). In Section 10.7.2 we cover Mundlak’s (1978) CRE 
framework in the context of the standard unobserved effects model. Mundlak’s 
approach yields important insights for understanding the difference between the 
random and fixed effects frameworks, and it is very useful for testing whether c; is 
uncorrelated with the regressors (the critical assumption in a traditional random 
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effects analysis). Further, Mundlak’s approach is indispensable for analyzing a broad 
class of nonlinear models with unobserved effects, a topic we cover in Part IV. 

In this book, we avoid referring to c; as a random effect or a fixed effect because of 
the unwanted connotations of these terms concerning the nature of c;. Instead, we 
will mostly refer to c; as an “unobserved effect” and sometimes as “unobserved het- 
erogeneity.”’ We will refer to “random effects assumptions,” “fixed effects assump- 
tions,” and “correlated random effects assumptions.” Also, we will label different 
estimation methods as random effects (RE) estimation, fixed effects (FE) estimation, 
and, less often for linear models, correlated random effects (CRE) estimation. The 
terminology for estimation methods is so ingrained in econometrics that it would be 
counterproductive to try to change it now. 


10.2.2 Strict Exogeneity Assumptions on the Explanatory Variables 


Traditional unobserved components panel data models take the Xx; as nonrandom. 
We will never assume the x; are nonrandom because potential feedback from y; to 
xis for s > t needs to be addressed explicitly. 

In Chapter 7 we discussed strict exogeneity assumptions in panel data models that 
did not explicitly contain unobserved effects. We now provide strict exogeneity 
assumptions for models with unobserved effects. 

In Section 10.1 we stated the strict exogeneity assumption in terms of zero corre- 
lation. For inference and efficiency discussions, it is more convenient to state the 
strict exogeneity assumption in terms of conditional expectations. With an unob- 
served effect, the most revealing form of the strict exogeneity assumption is 


E( vit | Xil, Xi2,--+,Xi7T, ci) = E( Yu | Xit, ci) = Xf + ci (10.12) 


for t=1,2,...,7. The second equality is the functional form assumption on 
E( Yi | Xin ci). It is the first equality that gives the strict exogeneity its interpretation. It 
means that, once x; and c; are controlled for, x;,; has no partial effect on y; for s # t. 

When assumption (10.12) holds, we say that the {x;,: t= 1,2,..., T} are strictly 
exogenous conditional on the unobserved effect c;. Assumption (10.12) and the corre- 
sponding terminology were introduced and used by Chamberlain (1982). We will 
explicitly cover Chamberlain’s approach to estimating unobserved effects models in 
the next chapter, but his manner of stating assumptions is instructive even for tradi- 
tional panel data analysis. 

Assumption (10.12) restricts how the expected value of y, can depend on explan- 
atory variables in other time periods, but it is more reasonable than strict exogeneity 
without conditioning on the unobserved effect. Without conditioning on an unob- 
served effect, the strict exogeneity assumption is 
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E( Vit | Xi, Xi2,---,Xir) = E( yy | Xin) = XP, (10.13) 


t=1,...,7. To see that assumption (10.13) is less likely to hold than assumption 
(10.12), first consider an example. Suppose that y; is output of soybeans for farm i 
during year ¢, and x;, contains capital, labor, materials (such as fertilizer), rainfall, 
and other observable inputs. The unobserved effect, c;, can capture average quality of 
land, managerial ability of the family running the farm, and other unobserved, time- 
constant factors. A natural assumption is that, once current inputs have been con- 
trolled for along with c;, inputs used in other years have no effect on output during 
the current year. However, since the optimal choice of inputs in every year generally 
depends on c;, it is likely that some partial correlation between output in year t and 
inputs in other years will exist if c; is not controlled for: assumption (10.12) is rea- 
sonable while assumption (10.13) is not. 

More generally, it is easy to see that assumption (10.13) fails whenever assumption 
(10.12) holds and the expected value of c; depends on (xj1,..., Xr). From the law of 
iterated expectations, if assumption (10.12) holds, then 


E(Ya| Xa,- -< Xir) = Sub + Ele; | Bayo. XiT), 


and so assumption (10.13) fails if E(c;|xi,...,xir) # E(c;). In particular, assump- 
tion (10.13) fails if c; is correlated with any of the xj. 

Given equation (10.11), the strict exogeneity assumption can be stated in terms of 
the idiosyncratic errors as 


E(ui | Xi,---,Xi7, ci) = 9, (1,2. e.n5 17. (10.14) 


This assumption, in turn, implies that explanatory variables in each time period are 
uncorrelated with the idiosyncratic error in each time period: 


X; ui) = 0, t= l aT: : 
E(x;, 0 1 F 10.15 


This assumption is much stronger than assuming zero contemporaneous correlation: 
E(X;uir) = 0, t= 1,..., T. Nevertheless, assumption (10.15) does allow arbitary cor- 
relation between c; and xx for all t, something we ruled out in Section 7.8. Later, we 
will use the fact that assumption (10.14) implies that u; and c; are uncorrelated. 

For examining consistency of panel data estimators, the zero covariance assump- 
tion (10.15) generally suffices. Further, assumption (10.15) is often the easiest way to 
think about whether strict exogeneity is likely to hold in a particular application. But 
standard forms of statistical inference, as well as the efficiency properties of standard 
estimators, rely on the stronger conditional mean formulation in assumption (10.14). 
Therefore, we focus on assumption (10.14). 
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10.2.3 Some Examples of Unobserved Effects Panel Data Models 


Our discussions in Sections 10.2.1 and 10.2.2 emphasize that in any panel data ap- 
plication we should initially focus on two questions: (1) Is the unobserved effect, c;, 
uncorrelated with x; for all £? (2) Is the strict exogeneity assumption (conditional on 
ci) reasonable? The following examples illustrate how we might organize our thinking 
on these two questions. 


Example 10.1 (Program Evaluation): A standard model for estimating the effects of 
job training or other programs on subsequent wages is 


log(wageit) = Or + Ziy + O1progir + Ci + Uir, (10.16) 


where i indexes individual and ¢ indexes time period. The parameter 0, denotes a 
time-varying intercept, and Z; is a set of observable characteristics that affect wage 
and may also be correlated with program participation. 

Evaluation data sets are often collected at two points in time. At t= 1, no one 
has participated in the program, so that prog; = 0 for all i. Then, a subgroup is 
chosen to participate in the program (or the individuals choose to participate), and 
subsequent wages are observed for the control and treatment groups in t = 2. Model 
(10.16) allows for any number of time periods and general patterns of program 
participation. 

The reason for including the individual effect, c;, is the usual omitted ability story: 
if individuals choose whether or not to participate in the program, that choice could 
be correlated with ability. This possibility is often called the self-selection problem. 
Alternatively, administrators might assign people based on characteristics that the 
econometrician cannot observe. 

The other issue is the strict exogeneity assumption of the explanatory variables, 
particularly progą. Typically, we feel comfortable with assuming that u; is uncorre- 
lated with progų. But what about correlation between u; and, say, prog; 1? Future 
program participation could depend on uş if people choose to participate in the 
future based on shocks to their wage in the past, or if administrators choose people as 
participants at time ¢+ 1 who had a low up. Such feedback might not be very im- 
portant, since c; is being allowed for, but it could be. See, for example, Bassi (1984) 
and Ham and Lalonde (1996). Another issue, which is more easily dealt with, is that 
the training program could have lasting effects. If so, then we should include lags of 
progų in model (10.16). Or, the program itself might last more than one period, in 
which case progi can be replaced by a series of dummy variables for how long unit i 
at time ¢ has been subject to the program. 
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Example 10.2 (Distributed Lag Model): Hausman, Hall, and Griliches (1984) esti- 
mate nonlinear distributed lag models to study the relationship between patents 
awarded to a firm and current and past levels of R&D spending. A linear, five-lag 
version of their model is 


patents, = Oi + Ziy + 0p RD + 01 RD; t1 + +++ +ô5RD;i t5 + Ci + Ui, (10.17) 


where RD is spending on R&D for firm i at time ¢ and z; contains variables such as 
firm size (as measured by sales or employees). The variable c; is a firm heterogeneity 
term that may influence patents;, and that may be correlated with current, past, and 
future R&D expenditures. Interest lies in the pattern of the ô; coefficients. As with the 
other examples, we must decide whether R&D spending is likely to be correlated with 
ci. In addition, if shocks to patents today (changes in uy) influence R&D spending at 
future dates, then strict exogeneity can fail, and the methods in this chapter will not 
apply. 


The next example presents a case where the strict exogeneity assumption is neces- 
sarily false, and the unobserved effect and the explanatory variable must be correlated. 


Example 10.3 (Lagged Dependent Variable): A simple dynamic model of wage de- 
termination with unobserved heterogeneity is 
log(wageir) = pı log(wage; t1) + Ci + tit, ac ke ere Ge (10.18) 


Often, interest lies in how persistent wages are (as measured by the size of f,) after 
controlling for unobserved heterogeneity (individual productivity), c;. Letting y; = 
log(wagejr), a standard assumption would be 


E(uit | Yi t-15- -3 Vio Ci) =0, (10.19) 
which means that all of the dynamics are captured by the first lag. Let Xi = Y; ,_1. 
Then, under assumption (10.19), up is uncorrelated with (xj, Xi t1,- --, Xj), but uz 
cannot be uncorrelated with (xj,141,...,Xi7r), aS Xi,r41 = Yj. In fact, 

E( Yit) = ByE(y;, 14i) + Ec) + E(uz) = E(uz) > 0, (10.20) 


because E(y; 1u) = 0 and E(cjuj,) =0 under assumption (10.19). Therefore, the 
strict exogeneity assumption never holds in unobserved effects models with lagged 
dependent variables. 

In addition, y; ,; and c; are necessarily correlated (since at time ¢— 1, y; ,_, is the 
left-hand-side variable). Not only must strict exogeneity fail in this model, but the 
exogeneity assumption required for pooled OLS estimation of model (10.18) is also 
violated. We will study estimation of such models in Chapter 11. 
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10.3 Estimating Unobserved Effects Models by Pooled Ordinary Least Squares 


Under certain assumptions, the pooled OLS estimator can be used to obtain a con- 
sistent estimator of } in model (10.11). Write the model as 


P= Pty £512,247, (10.21) 


where vj, = Ci + Uin t= 1,..., T are the composite errors. For each ¢, vj; is the sum of 
the unobserved effect and an idiosyncratic error. From Section 7.8, we know that 
pooled OLS estimation of this equation is consistent if E(x/,v7) = 0, t= 1,2,...,T. 
Practically speaking, no correlation between xy and v;, means that we are assuming 
E(xj,uir) = 0 and 


E(xic) =0,  1=1,2,...,T. (10.22) 


Equation (10.22) is the more restrictive assumption, since E(x/,u;,) = 0 holds if we 
have successfully modeled E( y; | Xi, ci). 

In static and finite distributed lag models we are sometimes willing to make the 
assumption (10.22); in fact, we will do so in the next section on random effects esti- 
mation. As seen in Example 10.3, models with lagged dependent variables in xy must 
violate assumption (10.22) because y; ,_; and c; must be correlated. 

Even if assumption (10.22) holds, the composite errors will be serially correlated 
due to the presence of c; in each time period. Therefore, inference using pooled OLS 
requires the robust variance matrix estimator and robust test statistics from Chapter 
7. Because v; depends on c; for all t, the correlation between v;, and vj; does not 
generally decrease as the distance |f — s| increases; in time series parlance, the v; are 
not weakly dependent across time. (We show this property explicitly in the next sec- 
tion when {u;i : t= 1,..., T} is homoskedastic and serially uncorrelated.) Therefore, 
it is important that we be able to rely on large-N and fixed-T asymptotics when 
applying pooled OLS. 

As we discussed in Chapter 7, each (y,;,X;) has T rows and should be ordered 
chronologically, and the (y;, X;) should be stacked from i= 1,...,N. The order of 
the cross section observations is, as usual, irrelevant. 


10.4 Random Effects Methods 


10.4.1 Estimation and Inference under the Basic Random Effects Assumptions 


As with pooled OLS, a random effects analysis puts c; into the error term. In fact, 
random effects analysis imposes more assumptions than those needed for pooled 
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OLS: strict exogeneity in addition to orthogonality between c; and Xy. Stating the 
assumption in terms of conditional means, we have 


ASSUMPTION RE. 1: 
(a) Elur | xi, ¢;) = 0, t= 1,..., 7; (b) E(c;| xi) = E(c;) = 0 


where x; = (Xj1,Xj2,---,Xir)- 


In Section 10.2 we discussed the meaning of the strict exogeneity Assumption 
RE.la. Assumption RE.1b is how we will state the orthogonality between c; and each 
Xg. For obtaining consistency results, we could relax RE.1b to assumption (10.22), 
but in practice (10.22) affords little more generality, and we will use Assumption 
RE.1b later to derive the traditional asymptotic variance for the random effects 
estimator. Assumption RE.1b is always implied by the assumption that the x; are 
nonrandom and E(c;) = 0, or by the assumption that c; is independent of x;. The 
important part is E(c;|x;) = E(c;); the assumption E(c;) = 0 is without loss of gen- 
erality, provided an intercept is included in x;, as should always be the case. 

Why do we maintain Assumption RE.1 when it is more restrictive than needed for 
a pooled OLS analysis? The random effects approach exploits the serial correlation in 
the composite error, vj, = c; + Uji, in a generalized least squares (GLS) framework. In 
order to ensure that feasible GLS is consistent, we need some form of strict exoge- 
neity between the explanatory variables and the composite error. Under Assumption 
RE.1 we can write 


Vie = XB + vir, (10.23) 
E(vj,|x;) = 0, CHA, Qaig FT, (10.24) 
where 

Vi = Ci + Ui. (10.25) 


Equation (10.24) shows that {Xx :t= 1,...,T} satisfies the strict exogeneity as- 
sumption SGLS.1 (see Chapter 7) in the model (10.23). Therefore, we can apply GLS 
methods that account for the particular error structure in equation (10.25). 

Write the model (10.23) for all T time periods as 


y; = Xp + vi (10.26) 


and v; can be written as v; = cijy + u;, where jy is the T x 1 vector of ones. Define the 
(unconditional) variance matrix of v; as 


Q = E(v;v’), (10.27) 
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a T x T matrix that we assume to be positive definite. Remember, this matrix is 
necessarily the same for all i because of the random sampling assumption in the cross 
section. 

For consistency of GLS, we need the usual rank condition for GLS: 


ASSUMPTION RE.2: rank E(X/Q7'X;) =K. 


Applying the results from Chapter 7, we know that GLS and feasible GLS are 
consistent under Assumptions RE.1 and RE.2. A general FGLS analysis, using an 
unrestricted variance estimator Q, is consistent and v N-asymptotically normal as 
N — œ. But we would not be exploiting the unobserved effects structure of vy. A 
standard random effects analysis adds assumptions on the idiosyncratic errors that 
give Q a special form. The first assumption is that the idiosyncratic errors u; have a 
constant unconditional variance across t: 

E(u2) = o? 


it u? 


$1, 2) oa, 7: (10.28) 
The second assumption is that the idiosyncratic errors are serially uncorrelated: 
E(ujuis) = 0, all t Æ s. (10.29) 


Under these two assumptions, we can derive the variances and covariances of the 
elements of v;. Under Assumption RE.1a, E(ciuņ) = 0, t= 1,2,...,7, and so 
E(vj,) = Ee?) + 2E(cimir) + E(uj,) = o¢ + oy, 


where g? = E(c?). Also, for all t 4 s, 


E(vj:vis) = Ef (c; + Uit) (Ci + Uis) | = E(c?) = a: 


$ 


Therefore, under assumptions RE.1, (10.28), and (10.29), Q takes the special form 


2 2 2 2 
on + oF ion ae O; 
2 2 2 
(oy o +a 
== C c u 
Q = E(v;v!) = (10.30) 
o? 
2 2 2 
Go. Oo. + on 


Because jj; is the T x T matrix with unity in every element, we can write the matrix 
(10.30) as 


Q = or + ori. (10.31) 
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When Q has the form (10.31), we say it has the random effects structure. Rather 
than depending on T(T + 1)/2 unrestricted variances and covariances, as would be 
the case in a general GLS analysis, @ depends only on two parameters, a? and ož, 
regardless of the size of T. The correlation between the composite errors v; and Vis 
does not depend on the difference between f and s: Corr(vjs, vi) = 02/(62 + a2) = 0, 
s Æ t. This correlation is also the ratio of the variance of c; to the variance of the 
composite error, and it is useful as a measure of the relative importance of the 
unobserved effect c;. 

Unlike the stable AR(1) model we briefly discussed in Section 7.8.6, the correlation 
between the composite errors vy and vis does not tend to zero as t and s get far apart 
under the RE covariance structure. If vj, = pu; -1 + e+, where |p| < 1 and the {er} 
are serially uncorrelated, then Cov(vj;, vis) converges to zero pretty quickly as |z — s| 
gets large. (Of course, the convergence is faster with smaller |p|.) Unlike standard 
models for serial correlation in time series settings, the random effects assumption 
implies strong persistence in the unobservables over time, due, of course, to the 
presence of cj. 

Assumptions (10.28) and (10.29) are special to random effects. For efficiency of 
feasible GLS, we assume that the variance matrix of v; conditional on x; is constant: 


E(viv; | xi) = E(v;v;). (10.32) 
Assumptions (10.28), (10.29), and (10.32) are implied by our third RE assumption: 


ASSUMPTION RE.3: (a) E(uju/ | x;,¢;) = 0717; (b) E(c? | x;) = 02. 


C 


Under Assumption RE.3a, E(u? | x;, ci) = až, t= 1,..., T, which implies assump- 
tion (10.28), and E(u;itis |X;, ci) = 0, t # s, ts = 1,..., T, which implies assumption 
(10.29) (both by the usual iterated expectations argument). But Assumption RE.3a is 
stronger because it assumes that the conditional variances are constant and the con- 
ditional covariances are zero. Along with Assumption RE.1b, Assumption RE.3b is 
the same as Var(c;|x;) = Var(c;), which is a homoskedasticity assumption on the 
unobserved effect c;. Under Assumption RE.3, assumption (10.32) holds and Q has 
the form (10.30). 

To implement an FGLS procedure, define c? = a? + a2. For now, assume that we 
have consistent estimators of a? and a2. Then we can form 


QO = Ir + &izi7, (10.33) 


a T x T matrix that we assume to be positive definite. In a panel data context, the 
FGLS estimator that uses the variance matrix (10.33) is what is known as the random 
effects estimator. 
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N “ly N 
Bre = (>: vax ) (>: vay) (10.34) 
i=l i=l 


Before we discuss obtaining Q and performing asymptotic inference under the full 
set of RE assumptions, it is important to know that the RE estimator is generally 
consistent with or without Assumption RE.3. Clearly, the form of Q is motivated by 
Assumption RE.3. Nevertheless, as we discussed in Section 7.5.3, incorrectly impos- 
ing restrictions on E(v;v;) does not cause inconsistency of FGLS provided the ap- 
propriate strict exogeneity assumption holds, as it does under Assumption RE.1. Of 
course, we have to insert plim(Q) (which necessarily has the RE form) in place of Q 
in Assumption RE.2, but that is a minor change. Practically important is that with- 
out Assumption RE.3, we need to make inference fully robust, but we hold off on 
that until Section 10.4.2. 

Under Assumption RE.3, the RE estimator is efficient in the class of estimators 
consistent under E(v;|x;) = 0, including pooled OLS and a variety of weighted least 
squares estimators, because RE is asymptotically equivalent to GLS under Assump- 
tions RE.1—-RE.3. The usual feasible GLS variance matrix—see equation (7.54)—is 
valid under Assumptions RE.1—RE.3. The only difference from the general analysis 
is that Q is chosen as in expression (10.33). 

In order to implement the RE procedure, we need to obtain 6? and 6?. Actually, it 
is easiest to first find 62 = 6? + 62. Under Assumption RE.3a, o2 = T~! Y` £ E(v?) 
for all i; therefore, averaging v? across all i and ¢ would give a consistent estimator of 
a2. But we need to estimate # to make this method operational. A convenient initial 
estimator of f is the pooled OLS estimator, denoted here by ğ. Let č; denote the 
pooled OLS residuals. A consistent estimator of a? is 


2 1 x <2 
ô = NT K) > dH (10.35) 
which is the usual variance estimator from the OLS regression on the pooled data. 
The degrees-of-freedom correction in equation (10.35)—that is, the use of NT — K 
rather than N7J—has no effect asymptotically. Under Assumptions RE.1—RE.3, 
equation (10.35) is a consistent estimator (actually, a V/N-consistent estimator) 
of o?. 

To find a consistent estimator of g2, recall that a? = E(vivis), all t # s. Therefore, 
for each i, there are T(T — 1)/2 nonredundant error products that can be used to 
estimate c2. If we sum all these combinations and take the expectation, we get, for 
each i, 
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ie —WE(P-2) 4-424 1) =e TP =1)/2, (10.36) 


where we have used the fact that the sum of the first T — 1 positive integers is 
T(T — 1)/2. As usual, a consistent estimator is obtained by replacing the expectation 
with an average (across i) and replacing v; with its pooled OLS residual. We also 
make a degrees-of-freedom adjustment as a small-sample correction: 


: 1 N T- T _ 
ô? = [INT(T _ 1) )/2— K DE 5 Vitis (10.37) 


=l s=] 


is a VN-consistent estimator of o2 under Assumptions RE.1-RE.3. Equation (10.37), 
without the degrees-of-freedom adipata, appears enen Ior example, Baltagi 
(2001, Sect. 2.3). Given 6? and 62, we can form 6? = ô? — G?. (The idiosyncratic 
error variance, a7, can also be estimated using the fixed effects method, which we 
discuss in Section 10.5. Also, there are other methods of estimating 2. A common 
estimator of a2 is based on the between estimator of £, which we touch on in Section 
10.5; see Hsiao (2003, Sect. 3.3) and Baltagi (2001, Sect. 2.3). Because the RE 
estimator is a feasible GLS estimator under strict exogeneity, all that we need are 
consistent estimators of o? and co? in order to obtain a v N-efficient estimator of £.) 

As a practical matter, equation (10.37) is not guaranteed to be positive, although 
it is in the vast majority of applications. A negative value for G? is indicative of neg- 
ative serial correlation in uy, probably a substantial amount, which means that As- 
sumption RE.3a is violated. Alternatively, some other assumption in the model can 
be false. We should make sure that time dummies are included in the model if ag- 
gregate effects are important; omitting them can induce serial correlation in the 
implied u;. When the intercepts are allowed to change freely over time, the effects of 
other aggregate variables will not be identified. If ô? is negative, unrestricted FGLS 
may be called for; see Section 10.4.3. 


Example 10.4 (RE Estimation of the Effects of Job Training Grants): We now use 
the data in JT[RAIN1I.RAW to estimate the effect of job training grants on firm scrap 
rates, using a random effects analysis. There are 54 firms that reported scrap rates for 
each of the years 1987, 1988, and 1989. Grants were not awarded in 1987. Some firms 
received grants in 1988, others received grants in 1989, and a firm could not receive a 
grant twice. Since there are firms in 1989 that received a grant only in 1988, it is im- 
portant to allow the grant effect to persist one period. The estimated equation is 
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see, 


log(scrap) = .415 — .093 d88— .270 d89+ .548 union 
(.243) (.109) (.132) (.411) 


— .215 grant — .377 grant. 
(.148) (.205) 


The lagged value of grant has the larger impact and is statistically significant at the 5 
percent level against a one-sided alternative. You are invited to estimate the equation 
without grant_, to verify that the estimated grant effect is much smaller (on the order 
of 6.7 percent) and statistically insignificant. 


Multiple hypotheses tests are carried out as in any FGLS analysis; see Section 7.6, 
where G = T. In computing an F-type statistic based on weighted sums of squared 
residuals, Q in expression (10.33) should be based on the pooled OLS residuals from 
the unrestricted model. Then, obtain the residuals from the unrestricted random 
effects estimation as ¥; = y; — X;Êrr. Let Ber denote the REs estimator with the Q 
linear restrictions imposed, and define the restricted RE residuals as ¥; = y; — XiBpr- 
Insert these into equation (7.56) in place of u; and ù; for a chi-square statistic or into 
equation (7.57) for an F-type statistic. 

In Example 10.4, the Wald test for joint significance of grant and grant_, (against a 
two-sided alternative) yields a y2 statistic equal to 3.66, with p-value = .16. (This test 
comes from Stata.) 


10.4.2 Robust Variance Matrix Estimator 


Because failure of Assumption RE.3 does not cause inconsistency in the RE esti- 
mator, it is very useful to be able to conduct statistical inference without this as- 
sumption. Assumption RE.3 can fail for two reasons. First, E(v;v; | x;) may not be 
constant, so that E(v;v; | x;) # E(v;v/). This outcome is always a possibility with GLS 
analysis. Second, E(v;v;) may not have the RE structure: the idiosyncratic errors uj, 
may have variances that change over time, or they could be serially correlated. In 
either case a robust variance matrix is available from the analysis in Chapter 7. We 
simply use equation (7.52) with û; replaced by ¥; = y; — XiBer, i= 1,2,...,N, the 
T x 1 vectors of RE residuals. 

Robust standard errors are obtained in the usual way from the robust variance 
matrix estimator, and robust Wald statistics are obtained by the usual formula W = 
(RB — r)/(RVR’)'(RB — r), where V is the robust variance matrix estimator. Re- 
member, if Assumption RE.3 is violated, the sum of squared residuals form of the F 
statistic is not valid. 
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The idea behind using a robust variance matrix is the following. Assumptions 
RE.1—RE.3 lead to a well-known estimation technique whose properties are under- 
stood under these assumptions. But it is always a good idea to make the analysis 
robust whenever feasible. With fixed T and large N asymptotics, we lose nothing in 
using the robust standard errors and test statistics even if Assumption RE.3 holds. In 
Section 10.7.2, we show how the RE estimator can be obtained from a particular 
pooled OLS regression, which makes obtaining robust standard errors and ¢ and F 
statistics especially easy. 


10.4.3 General Feasible Generalized Least Squares Analysis 


If the idiosyncratic errors {uj,:t=1,2,...,7} are generally heteroskedastic and 
serially correlated across t, a more general estimator of Q can be used in FGLS: 


N 
Q=N'S vš, (10.38) 
i=l 


where the ý; would be the pooled OLS residuals. The FGLS estimator is consistent 
under Assumptions RE.1 and RE.2, and, if we assume that E(v;v; | x;) = Q, then the 
FGLS estimator is asymptotically efficient and its asymptotic variance estimator 
takes the usual form. 

Using equation (10.38) is more general than the RE analysis. In fact, with large N 
asymptotics, the general FGLS estimator is just as efficient as the RE estimator under 
Assumptions RE.1—RE.3. Using equation (10.38) is asymptotically more efficient if 
E(viv; | x;) = Q, but Q does not have the RE form. So why not always use FGLS 
with Q given in equation (10.38)? There are historical reasons for using RE methods 
rather than a general FGLS analysis. The structure of Q in the matrix (10.30) was 
once synonomous with unobserved effects models: any correlation in the composite 
errors {v : t= 1,2,..., T} was assumed to be caused by the presence of c;. The id- 
losyncratic errors, Uş, were, by definition, taken to be serially uncorrelated and 
homoskedastic. 

If N is not several times larger than T, an unrestricted FGLS analysis can have 
poor finite sample properties because Q has T (T + 1)/2 estimated elements. Even 
though estimation of Q does not affect the asymptotic distribution of the FGLS 
estimator, it certainly affects its finite sample properties. Random effects estimation 
requires estimation of only two variance parameters for any T. 

With very large N, using the general estimate of Q is an attractive alternative, es- 
pecially if the estimate in equation (10.38) appears to have a pattern different from 
the RE pattern. As a middle ground between a traditional random effects analysis 
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and a full-blown FGLS analysis, we might specify a particular structure for the id- 
iosyncratic error variance matrix E(wu;/). For example, if {up} follows a stable first- 
order SHORE Save process with autocorrelation coefficient p and variance i then 
Q = E(uju’) + o2j7j7;- depends in a known way on only three parameters, cĉ, a7, and 
p. These parameters can be estimated after initial pooled OLS estimation, and then 
an FGLS procedure using the particular structure of Q is easy to implement. We do 
not cover such possibilities explicitly; see, for example, MaCurdy (1982). Some are 
preprogrammed in statistical packages, but not necessarily with the option of making 
inference robust to misspecification of Q or system heteroskedasticity. 


10.4.4 Testing for the Presence of an Unobserved Effect 


If the standard random effects assumptions RE.1—RE.3 hold but the model does not 
actually contain an unobserved effect, pooled OLS is efficient and all associated 
pooled OLS statistics are asymptotically valid. The absence of an unobserved effect is 
statistically equivalent to Ho : a? = 0. 

To test Ho : a? = 0, we can use the simple test for AR(1) serial correlation covered 
in Chapter 7 (see equation (7.81)). The AR(1) test is valid because the errors vj, are 
serially uncorrelated under the null Ho : o2 = 0 (and we are assuming that {x} is 
strictly exogenous). However, a better test is based directly on the estimator of c? in 
equation (10.37). 

Breusch and Pagan (1980) derive a statistic using the Lagrange multiplier principle 
in a likelihood setting (something we cover in Chapter 13). We will not derive the 
Breusch and Pagan statistic because we are not assuming any particular distribution 
for the v;,. Instead, we derive a similar test that has the advantage of being valid for 
any distribution of v; and only states that the v; are uncorrelated under the null. (In 
particular, the statistic is valid for heteroskedasticity in the vj.) 

From equation (10.37), we base a test of Ho: o? = 0 on the null asymptotic distri- 
bution of 


N T-1 


nary: D birdie, (10.39) 


i=l =l s=t+1 


which is essentially the estimator ĉ2 scaled up by VN. Because of strict exogeneity, 
this statistic has the same limiting distribution (as N — co with fixed T) when 
we replace the pooled OLS residuals 0;, with Mie errors Vi (see Problem 7.4). For 
any distribution of the vm, N2 YSA DA D Laivu has a limiting normal 
distribution (under the null ae the v;, are serially uncorrelated) with variance 
Ey Ys wi Vivis)”. We can estimate this variance in the usual way (take away 
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the expectation, average across i, and replace v; with 6;,). When we put expression 
(10.39) over its asymptotic standard error we get the statistic 


N T-1 T A A 
Viz! paar paar VitVis 
29 1/2" 
N T-1 T a a 
paw ( t=1 D Puda) ] 


Under the null hypothesis that the v; are serially uncorrelated, this statistic is dis- 
tributed asymptotically as standard normal. Unlike the Breusch-Pagan statistic, with 
expression (10.40) we can reject Ho for negative estimates of o2, although negative 
estimates are rare in practice (unless we have already differenced the data, something 
we discuss in Section 10.6). 

The statistic in expression (10.40) can detect many kinds of serial correlation in the 
composite error v, and so a rejection of the null should not be interpreted as imply- 
ing that the RE error structure must be true. Finding that the v; are serially uncor- 
related is not very surprising in applications, especially since X; cannot contain 
lagged dependent variables for the methods in this chapter. 

It is probably more interesting to test for serial correlation in the {u;}, as this is a 
test of the RE form of Q. Baltagi and Li (1995) obtain a test under normality of c; 
and {u;i}, based on the Lagrange multiplier principle. In Section 10.7.2, we discuss a 
simpler test for serial correlation in {u;,} using a pooled OLS regression on trans- 
formed data, which does not rely on normality. 


(10.40) 


10.5 Fixed Effects Methods 


10.5.1 Consistency of the Fixed Effects Estimator 
Again consider the linear unobserved effects model for T time periods: 
Vie = Xuß + Ci + uit, (ea beer! i (10.41) 


The RE approach to estimating f effectively puts c; into the error term, under the 
assumption that c; is orthogonal to x;;, and then accounts for the implied serial cor- 
relation in the composite error vj, = C; + uz, using a GLS analysis. In many applica- 
tions the whole point of using panel data is to allow for c; to be arbitrarily correlated 
with the x;,. A fixed effects analysis achieves this purpose explicitly. 

The T equations in the model (10.41) can be written as 


y; = Xp + ciir + U;, (10.42) 
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where jy is still the T x 1 vector of ones. As usual, equation (10.42) represents a sin- 
gle random draw from the cross section. 

The first fixed effects (FE) assumption is strict exogeneity of the explanatory vari- 
ables conditional on c;: 


ASSUMPTION FE.1: E(u |x;,¢;) = 0, t= 1,2,...,T. 


This assumption is identical to the first part of Assumption RE.1. Thus, we maintain 
strict exogeneity of {x;,: t= 1,...,7} conditional on the unobserved effect. The key 
difference is that we do not assume RE.1b. In other words, for FE analysis, E(c; | x;) 
is allowed to be any function of x;. 

By relaxing RE.1b we can consistently estimate partial effects in the presence of 
time-constant omitted variables that can be arbitrarily related to the observables x‘. 
Therefore, FE analysis is more robust than random effects analysis. As we suggested 
in Section 10.1, this robustness comes at a price: without further assumptions, we 
cannot include time-constant factors in Xy. The reason is simple: if c; can be arbi- 
trarily correlated with each element of x;,, there is no way to distinguish the effects of 
time-constant observables from the time-constant unobservable c;. When analyzing 
individuals, factors such as gender or race cannot be included in x;;. For analyzing 
firms, industry cannot be included in x; unless industry designation changes over time 
for at least some firms. For cities, variables describing fixed city attributes, such as 
whether or not the city is near a river, cannot be included in xj. 

The fact that x; cannot include time-constant explanatory variables is a drawback 
in certain applications, but when the interest is only on time-varying explanatory 
variables, it is convenient not to have to worry about modeling time-constant factors 
that are not of direct interest. 

In panel data analysis the term “time-varying explanatory variables” means that 
each element of x; varies over time for some cross section units. Often there are ele- 
ments of x; that are constant across time for a subset of the cross section. For ex- 
ample, if we have a panel of adults and one element of x; is education, we can allow 
education to be constant for some part of the sample. But we must have education 
changing for some people in the sample. 

As a general specification, let d2;,...,dT, denote time period dummies so that 
ds, = 1 if s = t, and zero otherwise (often these are defined in terms of specific years, 
such as d88,, but at this level we call them time period dummies). Let z; be a vector of 
time-constant observables, and let w;, be a vector of time-varying variables. Suppose 
Yi is determined by 
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Vip = O1 + Ond2, + +++ + OrdT, + Ziy, + d2 2:95 
+ +++ + dT 2:97 + Wid + Ci + Uit. (10.43) 
E(u | Zi, Wi, Wi2,---, Wir, ci) = 9, FS AD 3 De (10.44) 


We hope that this model represents a causal relationship, where the conditioning on 
ci allows us to control for unobserved factors that are time constant. Without further 
assumptions, the intercept 0; cannot be identified and the vector y} on z; cannot 
be identified, because 0; + z;y, cannot be distinguished from c;. Note that 0; is the 
intercept for the base time period, t = 1, and y) measures the effects of z; on y; in 
period ź = 1. Even though we cannot identify the effects of the z; in any particular 
time period, y2, 73,...,y7 are identified, and therefore we can estimate the differences 
in the partial effects on time-constant variables relative to a base period. In particu- 
lar, we can test whether the effects of time-constant variables have changed over time. 
As a specific example, if y,, = log(wageir) and one element of z; is a female binary 
variable, then we can estimate how the gender gap has changed over time, even 
though we cannot estimate the gap in any particular time period. 

The idea for estimating f under Assumption FE.1 is to transform the equations to 
eliminate the unobserved effect c;. When at least two time periods are available, there 
are several transformations that accomplish this purpose. In this section we study the 
fixed effects transformation, also called the within transformation. The FE transfor- 
mation is obtained by first averaging equation (10.41) over t=1,...,T to get the 
cross section equation 


Ji = Xiß + ci + ti, (10.45) 


where j= TISE ye % = TOE xa, and ü; =T! fun Subtracting 
equation (10.45) from equation (10.41) for each ¢ gives the FE transformed equation, 


Vit — Yi = (Xie — Ki) BP + Uir — üi 
or 


Pa = Xab + ün  t=1,2,...,T, (10.46) 


where Yy = Ya — Ji Xit = Xin — Xi, and ür = uj — ü; The time demeaning of the 
original equation has removed the individual specific effect c;. 

With c; out of the picture, it is natural to think of estimating equation (10.46) by 
pooled OLS. Before investigating this possibility, we must remember that equation 
(10.46) is an estimating equation: the interpretation of $ comes from the (structural) 
conditional expectation E( y; |X; ci) = E(y;,| Xi, Ci) = Xuß + ci. 
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To see whether pooled OLS estimation of equation (10.46) will be consistent, we 
need to show that the key pooled OLS assumption (Assumption POLS.1 from 
Chapter 7) holds in equation (10.46). That is, 


E(ž;ün) = 0, P2912. T (10.47) 


For each ¢, the left-hand side of equation (10.47) can be written as 
E[(xi: — X;) (un — ū;)]. Now, under Assumption FE.1, up is uncorrelated with X;s, 
for all s,t=1,2,...,7. It follows that uy and i; are uncorrelated with xy and X; 
for t= 1,2,...,7. Therefore, assumption (10.47) holds under Assumption FE.1, 
and so pooled OLS applied to equation (10.46) can be expected to produce con- 
sistent estimators. We can actually say a lot more than condition (10.47): under 
Assumption FE.1, E(ü; | x;) = E(uir | x;) — E(@@;| xi) = 0, which in turn implies that 
E(ti | Xi1,...,Xir) = 0, since each Xy is just a function of x; = (xj1,...,x;r). This 
result shows that the X; satisfy the conditional expectation form of the strict exoge- 
neity assumption in the model (10.46). Among other things, this conclusion implies 
that the FE estimator of f that we will derive is actually unbiased under Assumption 
FE.1. 

It is important to see that assumption (10.47) fails if we try to relax the strict exo- 
geneity assumption to something weaker, such as E(x;j,u,) = 0, all £, because this as- 
sumption does not ensure that x;, is uncorrelated with ux, s 4 t. In Chapter 11 we 
study violations of the strict exogeneity assumption in more detail. 

The FE estimator, denoted by i is the pooled OLS estimator from the regression 


Paon ža  t=1,2,...,T f= 1,2,...,N. (10.48) 


The FE estimator is simple to compute once the time demeaning has been carried 
out. Some econometrics packages have special commands to carry out FE estimation 
(and commands to carry out the time demeaning for all 7). It is also fairly easy to 
program this estimator in matrix-oriented languages. 

To study the FE estimator a little more closely, write equation (10.46) for all time 
periods as 


y, = Xp + ü;, (10.49) 


where ÿ; is T x 1, X; is T x K, and ü; is T x 1. This set of equations can be obtained 
by premultiplying equation (10.42) by a time-demeaning matrix. Define Qy = Ir — 
ir(@;ir) ‘i, which is easily seen to be a T x T symmetric, idempotent matrix with 
rank T — 1. Further, Qrir = 0, Qry; = y;, Q7Xi = X;, and Qru; = ü;, and so pre- 
multiplying equation (10.42) by Qr gives the demeaned equations (10.49). 
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In order to ensure that the FE estimator is well behaved asymptotically, we need a 
standard rank condition on the matrix of time-demeaned explanatory variables: 


ASSUMPTION FE.2: rank (37), E(5, i) ) = rank[E(X/X;)] = K. 


If xy contains an element that does not vary over time for any i, then the corre- 
sponding element in X; is identically zero for all t and any draw from the cross sec- 
tion. Since X; would contain a column of zeros for all i, Assumption FE.2 could not 
be true. Assumption FE.2 shows explicitly why time-constant variables are not 
allowed in fixed effects analysis (unless they are interacted with time-varying vari- 
ables, such as time dummies). 

The FE estimator can be expressed as 


f= (ORR) (S2) = (Tose) (Home) as 
i=l i=l i=l (=l i=l (=l 

It is also called the within estimator because it uses the time variation within each 
cross section. The between estimator, which uses only variation between the cross 
section observations, is the OLS estimator applied to the time-averaged equation 
(10.45). This estimator is not consistent under Assumption FE.1 because E(X/c;) is 
not necessarily zero. The between estimator is consistent under Assumption RE.1 
and a standard rank condition, but it effectively discards the time series information 
in the data set. It is more efficient to use the RE estimator. 

Under Assumption FE.1 and the finite sample version of Assumption FE.2, 
namely, rank(X'X) = K, By; can be shown to be unbiased conditional on X. 


10.5.2 Asymptotic Inference with Fixed Effects 


Without further assumptions the FE estimator is not necessarily the most efficient 
estimator based on Assumption FE.1. The next assumption ensures that FE is efficient. 


ASSUMPTION FE.3:  E(uju/ | x;, ¢;) = 0717. 


Assumption FE.3 is identical to Assumption RE.3a. Since E(u; | x;,c;) = 0 by As- 
sumption FE.1, Assumption FE.3 is the same as saying Var(u;|x;,c;) = 717 if 
Assumption FE.1 also holds. As with Assumption RE.3a, it is useful to think of 
Assumption FE.3 as having two parts. The first is that E(u;u; | x;, c;) = E(ujus), 
which is standard in system estimation contexts (see equation (7.53)). The second is 
that the unconditional variance matrix E(uju/) has the special form a7I 7. This implies 
that the idiosyncratic errors u; have a constant variance across ¢ and are serially 
uncorrelated, just as in assumptions (10.28) and (10.29). 
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Assumption FE.3, along with Assumption FE.1, implies that the unconditional 
variance matrix of the composite error v; = cij7 +u; has the RE form. However, 
without Assumption RE.3b, E(v;v; | x;) # E(viv;). While this result matters for infer- 
ence with the RE estimator, it has no bearing on a fixed effects analysis. 

It is not obvious that Assumption FE.3 has the desired consequences of ensuring 
efficiency of FE and leading to simple computation of standard errors and test sta- 
tistics. Consider the demeaned equation (10.46). Normally, for pooled OLS to be 
relatively efficient, we require that the {ü; : t= 1,2,..., T} be homoskedastic across 
t and serially uncorrelated. The variance of ü; can be computed as 
E(iij,) = El(wa — ai)"] = E(u?) + E(@?) — 2E (wai) 


it 
= 067 +0}/T —2027/T = o2(1—1/T). (10.51) 


which verifies (unconditional) homoskedasticity across t. However, for t # s, the 
covariance between ü; and tis is 


E(dijriiis) = El(uit — i) (uis — ;)] = E(uittis) — E(uirté;) — E(uisit:) + E(u?) 

=0-03/T —02/T+02/T = —o}/T <0. 
Combining this expression with the variance in equation (10.51) gives, for all t 4 s, 
Corr(ii;, tis) = —1/(T — 1) (10.52) 


which shows that the time-demeaned errors ü; are negatively serially correlated. (As 
T gets large, the correlation tends to zero.) 

It turns out that, because of the nature of time demeaning, the serial correlation in 
the ü; under Assumption FE.3 causes only minor complications. To find the asymp- 
totic variance of Br, write 


(t= (Eaa) (mesa) 


where we have used the important fact that Xi; = X;Qru; = X!uy. Under Assump- 
tion FE.3, E(u,u/ | X;) = 0217. From the system OLS analysis in Chapter 7 it follows 
that 


VN (Bre — B) ~ Normal(0, 0; [E(X X) '), 
and so 


Avar( Bre) = JEX)! /N. (10.53) 
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Given a consistent estimator 6? of o2, Savon (10. 53) i is easily estimated by also 
replacing E(X/X;) with its simple analogue N- IS pcx 


N T a 
Avar(f (Bre) = a(y X!X; J =ô? bs 5 sa) . (10.54) 
i=l tl 
The asymptotic standard errors of the FE estimates are obtained as the square roots 
of the diagonal elements of the matrix (10.54). 

Expression (10.54) is very convenient because it looks just like the usual OLS 
variance matrix estimator that would be reported from the pooled OLS regression 
(10.48). However, there is one catch, and this comes in obtaining the estimator 6? of 
a. The errors in the transformed model are ii;,, and these errors are what the OLS 
residuals from regression (10.48) estimate. Since a is the variance of uir, we must use 
a little care. 

To see how to estimate a7, we use equation (10.51) summed across t: ee E(u) = 


it 


(T — 1)o2, and so [N(T — 1) SA, OL, E(ü2) = a2. Now, define the fixed effects 


u? 


residuals as 
Üu = Ða — Žaßre,  t=1,2,...,T;i=1,2,...,N, (10.55) 


which are simply the OLS residuals from the pooled regression (10.48). Then a con- 
sistent estimator of g? under Assumptions FE.1—FE.3 is 


6? = SSR/[N(T — 1) — K], (10.56) 


where SSR = y 1 a | Ww. The subtraction of K in the denominator of equation 
(10.56) does not matter asymptotically, but it is standard to make such a correction. 
In fact, under Assumptions FE.1—FE.3, it can be shown that 6? is actually an un- 
biased estimator of a? conditional on X (and therefore unconditionally as well). 

Pay careful attention to the denominator in equation (10.56). This is not the 
degrees of freedom that would be obtained from regression (10.48). In fact, the usual 
variance estimate from aes (10.48) would be SSR/(NT — K), which has a 
probability limit less than o? as N gets large. The difference between SSR/(NT — K) 
and equation (10.56) can be substantial when T is small. 

The upshot of all this is that the usual standard errors reported from the regression 
(10.48) will be too small on average because they use the incorrect estimate of a7. Of 
course, computing equation (10.56) directly is pretty trivial. But, if a standard re- 
gression package is used after time demeaning, it is perhaps easiest to adjust the usual 
standard errors directly. Since 6, appears in the standard errors, each standard error 
is simply multiplied by the factor {(NT — K)/[N(T — 1) — K]} 1/2. As an example, if 
N = 500, T = 3, and K = 10, the correction factor is about 1.227. 


Basic Linear Unobserved Effects Panel Data Models 307 


If an econometrics package has an option for explicitly obtaining fixed effects 
estimates using panel data, a7 will be properly estimated, and you do not have to 
worry about adjusting the standard errors. Many software packages also compute 
an estimate of ø, which is useful to determine how large the variance of the unob- 
served component is to the variance of the idiosyncratic component. Given Îpg, 62 = 
(NT — K)' ON, SOE (vn — Xapre) is a consistent estimator of o2 = o2 + o2, and 
so a consistent estimator of a? is G7 — 62. (See Problem 10.14 for a discussion of why 
the estimated variance of the unobserved effect in an FE analysis is generally larger 
than that for an RE analysis.) 


Example 10.5 (FE Estimation of the Effects of Job Training Grants): Using the data 
in JTRAIN1I.RAW, we estimate the effect of job training grants using the FE esti- 
mator. The variable union has been dropped because it does not vary over time for 
any of the firms in the sample. The estimated equation with standard errors is 


—. 


log(scrap) = —.080 d88 — .247 d89— .252 grant— .422 grant. 
(.109) (.133) (.151) (.210) 


Compared with the RE estimations the grant is estimated to have a larger effect, both 
contemporaneously and lagged one year. The ¢ statistics are also somewhat more 
significant with FE. 


Under Assumptions FE.1—FE.3, multiple restrictions are most easily tested using 
an F statistic, provided the degrees of freedom are appropriately computed. Let 
SSR,, be the unrestricted SSR from regression (10.48), and let SSR, denote the 
restricted sum of squared residuals from a similar regression, but with Q restrictions 
imposed on f. Then 


(SSR, —SSR,,) [N(T — 1) — K] 
SSR Q 


is approximately F distributed with Q and N(T — 1) — K degrees of freedom. (The 
precise statement is that Q- F ~ Xo as N — œ under Ho.) When this equation is 
applied to Example 10.5, the F statistic for joint significance of grant and grant_, is 
F = 2.23, with p-value = .113. 


10.5.3 Dummy Variable Regression 


So far we have viewed the c; as being unobservable random variables, and for most 
applications this approach gives the appropriate interpretation of $. Traditional 
approaches to fixed effects estimation view the c; as parameters to be estimated 
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along with £$. In fact, if Assumption FE.2 is changed to its finite sample version, 
rank(X’X) = K, then the model under Assumptions FE.1—FE.3 satisfies the Gauss- 
Markov assumptions conditional on X. 

If the c; are parameters to estimate, how would we estimate each c; along with fp? 
One possibility is to define N dummy variables, one for each cross section observa- 
tion: dn; = 1 if n = i, dn; = 0 if n # i. Then, run the pooled OLS regression 


yj, On d1;,d2;,...,dNi, Xi t= lZ cccg Ft = 1, 25cc5 N (10.57) 


Then, ĉ is the coefficient on d1;, ĉ is the coefficient on d2;, and so on. 

It is a nice exercise in least squares mechanics—in particular, partitioned regres- 
sion (see Davidson and MacKinnon, 1993, Sect. 1.4)—to show that the estimator of 
ß obtained from regression (10.57) is, in fact, the FE estimator. This is why Brg is 
sometimes referred to as the dummy variable estimator. Also, the residuals from re- 
gression (10.57) are identical to the residuals from regression (10.48). One benefit of 
regression (10.57) is that it produces the appropriate estimate of a? because it uses 
NT—-N-—K=N(T-—1)-—K as the degrees of freedom. Therefore, if it can be 
done, regression (10.57) is a convenient way to carry out FE analysis under Assump- 
tions FE.1—FE.3. 

There is an important difference between the é and frp. We already know that fpg 
is consistent with fixed T as N — oo. This is not the case with the ĉ;. Each time a new 
cross section observation is added, another c; is added, and information does not 
accumulate on the c; as N — oo. Each ĉ; is an unbiased estimator of c; when the c; 
are treated as parameters, at least if we maintain Assumption FE.1 and the finite 
sample analogue of Assumption FE.2. When we add Assumption FE.3, the Gauss- 
Markov assumptions hold (conditional on X), and ¢),@,...,éy are best linear 
unbiased conditional on X. (The é; give practical examples of estimators that are 
unbiased but not consistent.) 

Econometric software that computes fixed effects estimates rarely reports the 
“estimates” of the c; (at least in part because there are typically very many). Some- 
times it is useful to obtain the ĉ; even when regression (10.57) is infeasible. Using the 
OLS first-order conditions for the dummy variable regression, it can be shown that 


ĉi =7,-%Bre, i=1,2,...,N, (10.58) 


which makes sense because ¢; is the intercept for cross section unit 7. We can then 
focus on specific units—which is more fruitful with larger T—or we can compute the 
sample average, sample median, or sample quantiles of the ¢; to get some idea of how 
heterogeneity is distributed in the population. For example, the sample average 
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He = N'Y â = N` Soi ~ Xiĝrr) 


is easily seen to be a consistent estimator (as N — co) of the population average 
le = E(ci) = E(¥;) — E(X;)B. So, although we cannot estimate each c; very well with 
small T, we can often estimate features of the population distribution of c; quite well. 
In fact, many econometrics software packages report , as the “intercept” along with 
fis in fixed effects estimation. In other words, the intercept is simply an estimate of 
the average heterogeneity. (Seeing an “intercept” with fixed effects output can be a 
bit confusing because we have seen that the within transformation eliminates any 
time-constant explanatory variables. But having done that, we can then use the fixed 
effects estimates of J to estimate the mean of the heterogeneity distribution, and this 
is what is reported as the “‘intercept.”’) 

As with RE estimation, we can consistently estimate the variance g2 if we add the 
assumption that {u; : t= 1,2,...} is serially uncorrelated with constant variance, as 
implied by Assumption FE.3. First, we can consistently estimate the variance of the 
composite error, o2 = a? + a7, because vi = Vir — Xirf; therefore, 


u? 


N T 
(NT — K) “Soi xib) 


l t=1 


is consistent for a? as N — œ (where the subtraction of K is a standard degrees-of- 
freedom adjustment). Then, we can estimate a? as 6? = 6? — ô2, where G? is given by 
(10.56). We can use 6? (assuming it is positive) to assess the variability in heteroge- 
neity about its mean value. (Incidentally, as shown in Problem 10.14, one must be 
careful in using FE to estimate a? before applying RE because, with time-constant 
variables, the FE estimate of the heterogeneity variance will be too large. Pooled 
OLS should be used instead, as we discussed in Section 10.4.1.) 

When the c; are treated as different intercepts, we can compute an exact test of their 
equality under the classical linear model assumptions (which require, in addition to 
Assumptions FE.1 to FE.3, normality of the ux). The F statistic has an F distribution 
with numerator and demoninator degrees of freedom N — 1 and N(T — 1) — K, re- 
spectively. Interestingly, Orme and Yamagata (2005) show that the F statistic is still 
justified without normality when T is fixed and N — co, although they assume the 
idiosycratic errors are homoskedastic and serially uncorrelated. 

Generally, we should view the fact that the dummy variable regression (10.57) 
produces Êre as the coefficient vector on X; as a coincidence. While there are other 
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unobserved effects models where “estimating” the unobserved effects along with the 
vector f results in a consistent estimator of f, there are many cases where this 
approach leads to trouble. As we will see in Part IV, many nonlinear panel data 
models with unobserved effects suffer from an incidental parameters problem, where 
estimating the incidental parameters, c;, along with f produces an inconsistent esti- 
mator of $. 


10.5.4 Serial Correlation and the Robust Variance Matrix Estimator 


Recall that the FE estimator is consistent and asymptotically normal under 
Assumptions FE.1 and FE.2. But without Assumption FE.3, expression (10.54) gives 
an improper variance matrix estimator. While heteroskedasticity in uj; is always a 
potential problem, serial correlation is likely to be more important in certain appli- 
cations. When applying the FE estimator, it is important to remember that nothing 
rules out serial correlation in {u : t= 1,..., T}. While it is true that the observed 
serial correlation in the composite errors, vj; = Ci + Uin is dominated by the presence 
of c;, there can also be serial correlation that dies out over time. Sometimes, {u;i} can 
have very strong serial dependence, in which case the usual FE standard errors 
obtained from expression (10.54) can be very misleading. This possibility tends to be 
a bigger problem with large T. (As we will see, there is no reason to worry about 
serial correlation in ui when T = 2.) 

Testing the idiosyncratic errors, {wir}, for serial correlation is somewhat tricky. A 
key point is that we cannot estimate the up; because of the time demeaning used 
in FE, we can only estimate the time-demeaned errors, ü;. As shown in equation 
(10.52), the time-demeaned errors are negatively correlated if the u; are uncorrelated. 
When T = 2, ü; = —ü; for all i, and so there is perfect negative correlation. This 
finding shows that for T = 2 it is pointless to use the ü; to test for any kind of serial 
correlation pattern. 

When T > 3, we can use equation (10.52) to determine if there is serial correlation 
in {u;,}. Naturally, we use the fixed effects residuals, ü. One simplification is obtained 
by applying Problem 7.4: we can ignore the estimation error in f) in obtaining the 
asymptotic distribution of any test statistic based on sample covariances and vari- 
ances. In other words, it is as if we are using the ü, rather than the üu. The test is 
complicated by the fact that the {ü;} are serially correlated under the null hypothesis. 
There are two simple possibilities for dealing with this. First, we can just use any two 
time periods (say, the last two), to test equation (10.52) using a simple regression. In 
other words, run the regression 


ujr on üi T-1, i=1,...,N, 
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and use ô, the coefficient on ù; r-1, along with its standard error, to test Ho: 6 = 
—1/(T — 1), where 6 = Corr(ü; r-1, tir). Under Assumptions FE.1—FE.3, the usual ¢ 
statistic has an asymptotic normal distribution. (It is trivial to make this test robust 
to heteroskedasticity.) 

Alternatively, we can use more time periods if we make the ¢ statistic robust to 
arbitrary serial correlation. In other words, run the pooled OLS regression 


Üi ON Uj. 1-1, (23,265, Fb Sd ag, 


and use the fully robust standard error for pooled OLS; see equation (7.26). It may seem 
a little odd that we make a test for serial correlation robust to serial correlation, but this 
need arises because the null hypothesis is that the time-demeaned errors are serially 
correlated. This approach clearly does not produce an optimal test against, say, AR(1) 
correlation in the uy, but it is very simple and may be good enough to indicate a problem. 

Without Assumption FE.3, the asymptotic variance of VN(Êrg— B) has the 
sandwich form, [E(X/X;)|]~'E(X/iiii/X;)[E(X/X;)]"'. Thefore, if we suspect or find 
evidence of serial correlation, we should, at a minimum, compute a fully robust 
variance matrix estimator (and corresponding test statistics) for the FE estimator. 
But this is just as in Chapter 7 applied to the time-demeaned set of equations. Let 
ü = y; — Xiĝrg, i= 1,2,...,N, denote the T x 1 vectors of FE residuals. Direct 
application of equation (7.28) gives 


Avat(Brr) = (X'X)”' (>: Xia (X'X)', (10.59) 
i=] 


which was suggested by Arellano (1987) and follows from the general results of 
White (1984, Chap. 6). The robust variance matrix estimator is valid in the presence 
of any heteroskedasticity or serial correlation in {up : t = 1,..., T}, provided that T 
is small relative to N. (Remember, equation (7.28) is generally justified for fixed T, 
N — œ asymptotics.) The robust standard errors are obtained as the square roots 
of the diagonal elements of the matrix (10.59), and matrix (10.59) can be used as the 
V matrix in constructing Wald statistics. Unfortunately, the sum of squared residuals 
form of the F statistic is no longer asymptotically valid when Assumption FE.3 fails. 


Example 10.5 (continued): We now report the robust standard errors for the 
log(scrap) equation along with the usual FE standard errors: 


seit 


log(scrap) = —.080 d88— .247 d89— .252 grant— .422 grant. 
(.109) (.133) (.151) (.210) 
[.096] [.193] [.140] [.276] 
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The robust standard error on grant is actually smaller than the usual standard error, 
while the robust standard error on grant_, is larger than the usual one. As a result, 
the absolute value of the ¢ statistic on grant_; drops from about 2 to just over 1.5. 


Remember, with fixed T as N — oo, the robust standard errors are just as valid 
asymptotically as the nonrobust ones when Assumptions FE.1—FE.3 hold. But the 
usual standard errors and test statistics may be better behaved under Assumptions 
FE.1—FE.3 if N is not very large relative to T, especially if u; is normally distributed. 


10.5.5 Fixed Effects Generalized Least Squares 


Recall that Assumption FE.3 can fail for two reasons. The first is that the conditional 
variance matrix does not equal the unconditional variance matrix: E(ujuj | x;,¢c;) 4 
E(uju;). Even if E(uju; | x;, c;) = E(ujuj), the unconditional variance matrix may not 
be scalar: E(uju}) 4 o2I7, which means either that the variance of u; changes with t 
or, probably more important, that there is serial correlation in the idiosyncratic 
errors. The robust variance matrix (10.59) is valid in any case. 

Rather than compute a robust variance matrix for the FE estimator, we can in- 
stead relax Assumption FE.3 to allow for an unrestricted, albeit constant, conditional 
covariance matrix. This is a natural route to follow if the robust standard errors of 
the FE estimator are too large to be useful and if there is evidence of serial depen- 
dence or a time-varying variance in the uj. 


ASSUMPTION FEGLS.3: E(ujuj|x;,¢;) = A, a T x T positive definite matrix. 
Under Assumption FEGLS.3, E(ü;ü/ | X;) = E(ü;ü/). Further, using ü; = Qru;, 
E(u,ii;) = Q7E(uju;)Q7 = QrAQr. (10.60) 


which has rank T — 1. The deficient rank in expression (10.60) causes problems for 
the usual approach to GLS, because the variance matrix cannot be inverted. One way 
to proceed is to use a generalized inverse. A much easier approach—and one that 
turns out to be algebraically identical—is to drop one of the time periods from the 
analysis. It can be shown (see Im, Ahn, Schmidt, and Wooldridge, 1999) that it does 
not matter which of these time periods is dropped: the resulting GLS estimator is the 
same. 
For concreteness, suppose we drop time period T, leaving the equations 


Va = Xaß + ün 
(10.61) 


Yi r-1 = Xi, r-1ß + üi, r-1- 
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So that we do not have to introduce new notation, we write the system (10.61) as 
equation (10.49), with the understanding that now ÿ; is (T — 1) x 1, X; is (T — 1) x 
K, and ü; is (T — 1) x 1. Define the (T — 1) x (T — 1) positive definite matrix Q = 
E(ü;ü;). We do not need to make the dependence of Q on A and Qp explicit; the key 
point is that, if no restrictions are made on A, then Q is also unrestricted. 

To estimate Q, we estimate f by fixed effects in the first stage. After dropping 
the last time period for each i, define the (T — 1) x 1 residuals ŭ; = ¥; — XiBpp, i = 


1,2,...,N. A consistent estimator of Q is 

A N ~oy 

= NS üü. (10.62) 
=I 


The fixed effects GLS (FEGLS) estimator is defined by 


N lyn 
Brects = (>: a's) (È vay) , 
i=l i=l 
where X; and y; are defined with the last time period dropped. For consistency of 
FEGLS, we replace Assumption FE.2 with a new rank condition: 


ASSUMPTION FEGLS.2: rank E(X/Q°'X;) = 


Under Assumptions FE.1 and FEGLS.2, the FEGLS estimator is consistent. When 
we add Assumption FEGLS.3, the asymptotic variance is easy to estimate: 


Avar( Br (Brects) = pa: ô- x) 


The sum of squared residual statistics from FGLS can be used to test multiple 
restrictions. Note that G = T — 1 in the F statistic in equation (7.57). 

The FEGLS estimator was proposed by Kiefer (1980) when the c; are treated as 
parameters. As we just showed, the procedure consistently estimates B when we view 
ci as random and allow it to be arbitrarily correlated with xj. 

The FEGLS estimator is asymptotically no less efficient than the FE estimator 
under Assumption FEGLS.3, even when A = o7Ir. Generally, if A # aIr, FEGLS 
is more efficient than FE, but this conclusion relies on the large-N, fixed-T asymp- 
totics. Unfortunately, because FEGLS still uses the fixed effects transformation to 
remove c;, it can have large asymptotic standard errors if the matrices X; have col- 
umns close to zero. 

Rather than allowing © to be an unrestricted matrix, we can impose restrictions 
on A that imply Q has a restricted form. For example, Bhargava, Franzini, and 
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Narendranathan (1982) (BFN) assume that {u} follows a stable, homoskedastic 
AR(1) model. This assumption implies that @ depends on only three parameters, ož, 
2 and the AR coefficient, p, no matter how large T is. BFN obtain a transformation 
that eliminates the unobserved effect, c;, and removes the serial correlation in uj. 
They also propose estimators of p, so that feasible GLS is possible. 

Modeling {w;;} as a specific time series process is attractive when N is not very 
large relative to T, as estimating an unrestricted covariance matrix for ü; (the 
(T — 1) x 1 vector of time-demeaned errors) without large N can lead to poor finite- 
sample performance of the FGLS estimator. However, the only general statements 
we can make concern fixed-T, N — œ asymptotics. In this scenario, the FGLS esti- 
mator that uses unrestricted © is no less asymptotically efficient than an FGLS 
estimator that puts restrictions on Q. And, if the restrictions on Q are incorrect, the 
estimator that imposes the restrictions is less asymptotically efficient. Therefore, on 
purely theoretical grounds, we prefer an estimator of the type in equation (10.62). 

As with any FGLS estimator, it is always a good idea to compute a fully robust 
variance matrix estimator for the FEGLS estimator. The robust variance matrix 
estimator is still given by equation (7.52), but now we insert the time-demeaned 
regressors and FEGLS residuals in place of X; and û;, respectively. Therefore, the 
fully robust variance matrix estimator for the FEGLS estimator is 


oO 


oe N ly Nn N =l 
sinus) (Sras) (Sreser) (Srs) 
i=l i=l i=l 
where ti; = ¥; —XiBrrczs are the (T — 1) x 1 vectors of FEGLS residuals. If Ê 
is unrestricted the robust variance matrix estimator is robust to system hetero- 
skedasticity, which is generally present if E(ü;ü/ | X;) # E(ü;ü/). Remember, even if 
we allow the unconditional variance matrix of ü; to be unrestricted, we can never 
guarantee that the conditional variance matrix does not depend on the regressors. 
When Q is restricted, such as in the AR(1) model for {u;i}, the robust estimator is 
also robust to the retrictions imposed being incorrect. Using an AR(1) model for the 
idiosyncratic errors might be substantially more efficient than just using the usual FE 
estimator, and we can guard against incorrect asymptotic inference by using a fully 
robust variance matrix estimator. As we discussed earlier, if T is somewhat large, for 
better small-sample properties we may wish to impose restrictions on Q rather than 
use an unrestricted estimator. 
Baltagi and Li (1991) consider FEGLS estimation when {u;,} follows an AR(1) 
process. But their estimator of p is based on the FE residual—say, from the regres- 
sion i, on ti; 1-1, t=2,..., T—and this estimator is inconsistent for fixed T with 
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N — œ. In fact, if {ux} is serially uncorrelated, we know from equation (10.52) that 
plim(/) = —1/(T — 1). With large T the inconsistency may be small, but what are 
the consequences generally of using an inconsistent estimator of p? We know by 
now that using an inconsistent estimator of a variance matrix does not lead to in- 
consistency of a feasible GLS estimator, and this is no exception. But, of course, 
inference should be made fully robust, because if plim(p) # p, GLS does not fully 
eliminate the serial correlation in {u;,}. Even if plim() is “close” to p, we may want 
to guard against more general forms of serial correlation or violation of system 
homoskedasticity. 


10.5.6 Using Fixed Effects Estimation for Policy Analysis 


There are other ways to interpret the FE transformation to illustrate why fixed effects 
is useful for policy analysis and program evaluation. Consider the model 


Vit = Xuß + Vi = Ziy + ÔWi + Vit, 


where v; may or may not contain an unobserved effect. Let wp be the policy variable 
of interest; it could be continuous or discrete. The vector Z; contains other controls 
that might be correlated with w;, including time-period dummy variables. 

As an exercise, you can show that sufficient for consistency of fixed effects, along 
with the rank condition FE.2, is 


E[x!, (vi — õi)] = 0, t= 1.2 T. 


pope rey 


This assumption shows that each element of xy, and in particular the policy variable 
wiz, can be correlated with o;. What FE requires for consistency is that w; be uncor- 
related with deviations of v; from the average over the time period. So a policy 
variable, such as program participation, can be systematically related to the persistent 
component in the error vy as measured by d;. It is for this reason that FE is often 
superior to pooled OLS or random effects for applications where participation in a 
program is determined by preprogram attributes that also affect y,,. 


10.6 First Differencing Methods 


10.6.1 Inference 


In Section 10.1 we used differencing to eliminate the unobserved effect c; with T = 2. 
We now study the differencing transformation in the general case of model (10.41). 
For completeness, we state the first assumption as follows. 
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ASSUMPTION FD.1: Same as Assumption FE.1. 


We emphasize that the model and the interpretation of $ are exactly as in Section 
10.5. What differs is our method for estimating £. 
Lagging the model (10.41) one period and subtracting gives 


A Yi = AX iB + Aui, ee era! a (10.63) 


where Ay; = Vi, — Vit-w AX it = Xi — Xi, t-1, and Auj = Uit — Ui 1. AS with the FE 
transformation, this first-differencing transformation eliminates the unobserved effect 
ci. In differencing we lose the first time period for each cross section: we now have 
T — 1 time periods for each i, rather than T. If we start with T = 2, then, after dif- 
ferencing, we arrive at one time period for each cross section: A y; = Axj2f + Auj. 
Equation (10.63) makes it clear that the elements of x;, must be time varying (for at 
least some cross section units); otherwise Ax;, has elements that are identically zero 
for all i and t. Also, while the intercept in the original equation gets differenced away, 
equation (10.63) contains changes in time dummies if x; contains time dummies. In 
the T = 2 case, the coefficient on the second-period time dummy becomes the inter- 
cept in the differenced equation. If we difference the general equation (10.43) we get 


Ay; = 92(Ad2,) +--+ + O7(AdT,) + (Ad2;)zi7> 
+++ + (AdT, zing + Awid + Aui. (10.64) 


The parameters 0; and y, are not identified because they disappear from the trans- 
formed equation, just as with fixed effects. 

The first-difference (FD) estimator, Brp: is the pooled OLS estimator from the 
regression 


Ay;, on Axir, be ty de aN (10.65) 


Under Assumption FD.1, pooled OLS estimation of the first-differenced equations 
will be consistent because 


E(Ax!Auj)=0,  t=2,3,...,T. (10.66) 


Therefore, Assumption POLS.1 from Section 7.8 holds. In fact, strict exogeneity 
holds in the FD equation: 


E(Auj; | Axj2, Axi3,..., Ax;r) = 0, i. os eee U 


which means the FD estimator is actually unbiased conditional on X. 
To arrive at assumption (10.66) we clearly can get by with an assumption weaker 
than Assumption FD.1. The key point is that assumption (10.66) fails if u; is corre- 
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lated with x;.;-1, Xir, OF X; +1, and so we just assume that x;, is uncorrelated with u; 
for all t and s. 
For completeness, we state the rank condition for the FD estimator: 


ASSUMPTION FD.2: rank (7/0 2 E(AX,, Axi) ) =K: 


Assumption FD.2 clearly rules out explanatory variables in xy that are fixed across 
time for all 7. It also excludes perfect collinearity among the time-varying variables 
after differencing. There are subtle ways in which perfect collinearity can arise in AX;. 
For example, suppose that in a panel of individuals working in every year, we collect 
information on labor earnings, workforce experience (experi), and other variables. 
Then, because experience increases by one each year for every person in the sample, 
Aexper;, = 1 for all i and ¢. If x; also contains a full set of year dummies, then 
Aexperi is perfectly collinear with the changes in the year dummies. This is easiest to 
see in the T = 2 case, where Ad2, = 1 for t = 2, and so Aexperj, = Ad2, (remember, 
we only use the second time period in estimation). In the general case, it can be seen 
that for all t= 2,...,7, Ad2,+2Ad3,+.---+ (T — 1)AdT, = 1, and so Aexperi is 
perfectly collinear with Ad2,, Ad3,,..., AdT;. 

Assuming the data have been ordered as we discussed earlier, first differencing 
is easy to implement provided we keep track of which transformed observations 
are valid and which are not. Differences for observation numbers 1, T+ 1, 27 + 1, 
37 +1,..., and (N — 1)T + 1 should be set to missing. These observations corre- 
spond to the first time period for every cross section unit in the original data set; by 
definition, there is no first difference for the t = 1 observations. A little care is needed 
so that differences between the first time period for unit i+ 1 and the last time period 
for unit 7 are not treated as valid observations. Making sure these are set to missing 
is easy when a year variable or time period dummies have been included in the data 
set. 

One reason to prefer the FD estimator to the FE estimator is that FD is easier to 
implement without special software. Are there statistical reasons to prefer FD to FE? 
Recall that, under Assumptions FE.1—FE.3, the FE estimator is asymptotically effi- 
cient in the class of estimators using the strict exogeneity assumption FE.1. Therefore, 
the FD estimator is less efficient than FE under Assumptions FE.1—FE.3. Assump- 
tion FE.3 is key to the efficiency of FE. It assumes homoskedasticity and no serial 
correlation in uj. Assuming that the {ux: t= 1,2,...7} are serially uncorrelated 
may be too strong. An alternative assumption is that the first differences of the idio- 
syncratic errors, {e; = Auj,t=2,...,7}, are serially uncorrelated (and have con- 
stant variance). 
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ASSUMPTION FD.3: E(e;e; |X, ..., Xir, Ci) = olr, where e; is the (T — 1) x 1 
vector containing er, t= 2,...,T. 


Under Assumption FD.3 we can write uj; = Ui -1 + ex, So that no serial correlation 
in e; implies that uj, is a random walk. A random walk has substantial serial depen- 
dence, and so Assumption FD.3 represents an opposite extreme from Assumption 
FE.3. 

Under Assumptions FD.1—FD.3, it can be shown that the FD estimator is most 
efficient in the class of estimators using the strict exogeneity assumption FE.1. Fur- 
ther, from the pooled OLS analysis in Section 7.8, 


P 


Avar(Brp) = 62(AX’AX)"', (10.67) 


where G? is a consistent estimator of a2. The simplest estimator is obtained by com- 
puting the OLS residuals 


ên = Ay — AxiBrp (10.68) 


from the pooled regression (10.65). A consistent estimator of až is 
ig 
6. =(N(T-1)- K'Y &, (10.69) 


which is the usual error variance estimator from regression (10.65). These equations 
show that, under Assumptions FD.1—FD.3, the usual OLS standard errors from the 
FD regression (10.65) are asymptotically valid. 

Unlike in the FE regression (10.48), the denominator in equation (10.69) is cor- 
rectly obtained from regression (10.65). Dropping the first time period appropriately 
captures the lost degrees of freedom (N of them). 

Under Assumption FD.3, all statistics reported from the pooled regression on the 
first-differenced data are asymptotically valid, including F statistics based on sums of 
squared residuals. 


10.6.2 Robust Variance Matrix 


If Assumption FD.3 is violated, then, as usual, we can compute a robust variance 
matrix. The estimator in equation (7.26) applied in this context is 


Avar(Brp) = (AX'AX)~ ‘(Soaneeia x,)aany (10.70) 


where AX denotes the N(T — 1) x K matrix of stacked first differences of xj. 
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Example 10.6 (FD Estimation of the Effects of Job Training Grants): We now esti- 
mate the effect of job training grants on log(scrap) using first differencing. Specifi- 
cally, we use pooled OLS on 


Alog(scrapj) = 0; + 62d89, + B, Agrant;, + ByAgrant;, 1 + Aui. 


Rather than difference the year dummies and omit the intercept, we simply include an 
intercept and a dummy variable for 1989 to capture the aggregate time effects. If we 
were specifically interested in the year effects from the structural model (in levels), 
then we should difference those as well. 

The estimated equation is 


a 


Alog(scrap) = —.091 — .096 d89— .223 Agrant— .351 Agrant_ı 
(.091) (.125) (.131) (.235) 
[.088] [111] [.128] [.265] 


R? = .037, 


where the usual standard errors are in parentheses and the robust standard errors are 
in brackets. We report R? here because it has a useful interpretation: it measures the 
amount of variation in the growth in the scrap rate that is explained by Agrant and 
Agrant_, (and d89). The estimates on grant and grant_, are fairly similar to the FE 
estimates, although grant is now statistically more significant than grant_,. The usual 
F test for joint significance of Agrant and Agrant_, is 1.53, with p-value = .222. 


In the previous example we used a device that is common when applying FD esti- 
mation. Namely, rather than drop an overall intercept and include the differenced 
time dummies, we estimated an intercept and then included time dummies for T — 2 
of the remaining periods—in Example 10.6, just the third period. Generally, rather 
than using the T — 1 regressors (Ad2,, Ad3,,..., AdT;), it is often more convenient to 
use, say, (1,d3,,...,d7;). Because these sets of regressors involving the time dummies 
are nonsingular linear transformations of each other, the estimated coefficients on the 
other variables do not change, nor do their standard errors or any test statistics. 
In most regression packages, including an overall intercept makes it easier to obtain 
the appropriate R-squared for the FD equation. Of course, if one is interested in the 
coefficients on the original time dummies, then it is easiest to simply include all 
dummies in FD form and omit an overall intercept. 


10.6.3 Testing for Serial Correlation 


Under Assumption FD.3, the errors e; = Au; should be serially uncorrelated. We 
can easily test this assumption given the pooled OLS residuals from regression 
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(10.65). Since the strict exogeneity assumption holds, we can apply the simple form of 
the test in Section 7.8. The regression is based on T — 2 time periods: 


ĉit = Pi 6i1-1 + errori, b= 34 E SPS icc Ve (10.71) 


The test statistic is the usual ¢ statistic on f}. With T = 2 this test is not available, nor 
is it necessary. With T = 3, regression (10.71) is just a cross section regression be- 
cause we lose the t= 1 and ¢ = 2 time periods. 

If the idiosyncratic errors {u : t= 1,2,...,7} are uncorrelated to begin with, 
{ei : t= 2,3,..., T} will be autocorrelated. In fact, under Assumption FE.3 it is easily 
shown that Corr(ej, e;,;-1) = —.5. In any case, a finding of significant serial correla- 
tion in the e; warrants computing the robust variance matrix for the FD estimator. 


Example 10.6 (continued): We test for AR(1) serial correlation in the first-differenced 
equation by regressing ê; on ê; ;_; using the year 1989. We get p, = .237 with f statistic 
= 1.76. There is marginal evidence of positive serial correlation in the first differences 
Auj,. Further, f; = .237 is very different from p; = —.5, which is implied by the stan- 
dard random and fixed effects assumption that the u; are serially uncorrelated. 


An alternative to computing robust standard errors and test statistics is to use 
an FDGLS analysis under the assumption that E(e;e; | x;) is a constant (T — 1) x 
(T — 1) matrix. We omit the details, as they are similar to the FEGLS case in Section 
10.5.5. As with FEGLS, we could impose structure on E(u;u/), such as a stable, homo- 
skedastic AR(1) model, and then derive E(e;e/) in terms of a small set of parameters. 


10.6.4 Policy Analysis Using First Differencing 


First differencing a structural equation with an unobserved effect is a simple yet 
powerful method of program evaluation. Many questions can be addressed by having 
a two-year panel data set with control and treatment groups available at two points 
in time. 

In applying first differencing, we should difference all variables appearing in the 
structural equation to obtain the estimating equation, including any binary indicators 
indicating participation in the program. The estimates should be interpreted in the 
original equation because it allows us to think of comparing different units in the 
cross section at any point in time, where one unit receives the treatment and the other 
does not. 

In one special case it does not matter whether the policy variable is differenced. 
Assume that T = 2, and let progi denote a binary indicator set to one if person i was 
in the program at time t. For many programs, prog; = 0 for all i: no one participated 
in the program in the initial time period. In the second time period, progj2 is unity for 
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those who participate in the program and zero for those who do not. In this one case, 
Aprog; = progiz, and the FD equation can be written as 


Ayn = 92 + Azjry + 1 progiz + Aujr. (10.72) 


The effect of the policy can be obtained by regressing the change in y on the 
change in z and the policy indicator. When Az;2 is omitted, the estimate of 6, from 
equation (10.72) is the difference-in-differences (DD) estimator (see Problem 10.4): 
61 = AV neat — A¥conro) This is similar to the DD estimator from Section 6.3—see 
equation (6.53)—but there is an important difference: with panel data, the differences 
over time are for the same cross section units. 

If some people participated in the program in the first time period, or if more than 
two periods are involved, equation (10.72) can give misleading answers. In general, 
the equation that should be estimated is 


A Yi = Čr + AZiry + OAprogi + Air, (10.73) 


where the program participation indicator is differenced along with everything else, 
and the ¢, are new period intercepts. Example 10.6 is one such case. Extensions of the 
model, where prog; appears in other forms, are discussed in Chapter 11. 


10.7 Comparison of Estimators 


10.7.1 Fixed Effects versus First Differencing 


When we have only two time periods, FE estimation and FD produce identical esti- 
mates and inference, as you are asked to show in Problem 10.3. First differencing is 
easier to implement, and all procedures that can be applied to a single cross section— 
such as heteroskedasticity-robust inference—can be applied directly. 

When T > 2 and we are confident the strict exogeneity assumption holds, the 
choice between FD and FE hinges on the assumptions about the idiosyncratic errors, 
ui. In particular, the FE estimator is more efficient under Assumption FE.3—the uj; 
are serially uncorrelated—while the FD estimator is more efficient when u; follows a 
random walk. In many cases, the truth is likely to lie somewhere in between. 

If FE and FD estimates differ in ways that cannot be attributed to sampling error, 
we should worry about violations of the strict exogeneity assumption. If u; is cor- 
related with X; for any ¢ and s, FE and FD generally have different probability 
limits. Any of the standard endogeneity problems, including measurement error, time- 
varying omitted variables, and simultaneity, generally cause correlation between x; 
and u;,—that is, contemporaneous correlation—which then causes both FD and FE 
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to be inconsistent and to have different probability limits. (We explicitly consider 
these problems in Chapter 11.) In addition, correlation between u; and X;s for s # t 
causes FD and FE to be inconsistent. When lagged x; is correlated with uj, we can 
solve lack of strict exogeneity by including lags and interpreting the equation as a 
distributed lag model. More problematical is when u; is correlated with future Xir 
only rarely does putting future values of explanatory variables in an equation lead to 
an interesting economic model. In Chapter 11 we show how to estimate the parame- 
ters consistently when there is feedback from uj; to Xis, 5 > t. 
If we maintain contemporaneous exogeneity, that is 


E(X; ui) = 0, (10.74) 


then we can show that the FE estimator generally has an inconsistency that shrinks 
to zero at the rate 1/7, while the inconsistency of the FD estimator is essentially 
independent of T. More precisely, we can find the probability limits of the FE and 
FD estimators (with fixed T and N — oo) and then study the plims as a function of 
T. The inconsistency in these estimators is sometimes, rather loosely, called the 
“asymptotic bias.” 

Generally, under Assumption FE.1, we can write 


-1 


-1 
T! 5 PR) a P Bas) : (10.75) 
t=1 i=l 


where X; = X;— X;, as always, and we emphasize that we are taking the plim 
for fixed T and N — œ. Under (10.74), E(X!wir) = E[(xir — X;)'uir)] = —E(X/ui), 
and so TIY £ E(X/wi) = -T S L E(X!uiz) = —E(X/ai;). Now, we can easily 
bound both average moments in (10.75) if we assume that the process {(xj;, Uit) : t = 
1,2,...}, considered as a time series, is stable and weakly dependent. Actually, as will 
become clear, what we really should assume is that the time-demeaned sequence, 
{X}, is weakly dependent. This allows for, say, covariate processes such as x; = 
h; +r, where h; is time-constant heterogeneity and {rx} is weakly dependent, be- 
cause then X; = rj; the persistent component, h;, has been removed. For notational 
convenience, we just assume that {xj} is weakly dependent. (See Hamilton (1994) 
and Wooldridge (1994a) for general discussions of weak dependence for time series 
processes. Weakly dependent processes are commonly, if somewhat misleadingly, 
referred to as “stationary” processes.) 

Under weak dependence of the covariate process, T~! So, E(X/,%;;) is bounded as 
a function of T, and so is its inverse if we strengthen Assumption FE.2 to hold uni- 
formly in T—a mild restriction that holds under the rank condition if we impose 


plim(B;,) =Bt 
No 
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stationarity on {x : t= 1,2,...}. Further, Var(x;) and Var(a;) are O(T7!) under 
weak dependence. By the Cauchy-Schwartz inequality (for example, Davidson, 1994, 
Chap. 9) we have, for each j= 1,...,K, |E(Xjii;)| < [Var(x,)Var(a,)] "7 =O), 
It follows that 

plim (Arg) =ß+ O(1)-O(T") = 8+ OT") =f + rre(T), (10.76) 
where rrg(T)= O(T~') is the inconsistency as a function of T. It follows that 
the inconsistency in the FE estimator can be small if we have a reasonably large T, 
although the exact size of rrg(T) depends on features of the stochastic process 
(Xi tu): t= 1,2,...}. 

Hsiao (2003, Sect. 4.2) derives the exact form of rrg(T) (a scalar) for the stable 
autoregressive model, that is, with xj, = y;,-; and |f| < 1. The term rrg(T) is nega- 
tive and, for fixed T, increases in absolute value as f approaches unity. Unfortu- 
nately, finding rrg(T) in general means modeling {(xj, ui) : f= 1,2,...}, and so we 
cannot know how close rrg(T) is to zero for a given T without making specific 
assumptions. (In the AR(1) model, we effectively modeled the stochastic process of 
the regressor and error because the former is the lagged dependent variable and the 
latter is assumed to be serially uncorrelated.) 

For the FD estimator, the general probability limit is 


=I -1 


T T 
plim (frp) = B+|(T-1) YO Ex Ax] (T-X E(x Aui) 
00 t=2 t=2 


(10.77) 


If {x;,: t= 1,2,...} is weakly dependent, so is Axy, and so the first average in (10.77) 
is generally bounded. (In fact, under stationarity this average does not depend on T.) 
Under (10.74), 


E(Ax; Aui) = —[E(x;t4;,-1) + E(x; 1-Uir)];, 


which is generally nonzero. Under stationarity, E(Ax/,Au;;) does not depend on ¢, and 
so the second average in (10.77) does not depend on T. Even if we assume ui; is 
uncorrelated with past covariates—a sequential exogeneity assumption conditional 
on c;—so that E(x; 14i) = 0, E(x/,u;,;-1) does not equal zero if there is feedback 
from current idissyncratic errors to future values of the covariates. Therefore, with 
sequential exogeneity, the second term is —(T — 1)! 5E > E(X},ui 1), which equals 
—E(xj,uj1) under stationarity, and is generally O(1) even without stationarity. 

The previous analysis shows that under contemporaneous exogeneity and weak 
dependence of the regressors and idiosyncratic errors, the FE estimator has an 
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advantage over the FD estimator when T is large. Interestingly, a more detailed 
analysis shows that the same order of magnitude holds for rrg(T) if some regressors 
have unit roots in their time series representation, that is, they are “integrated of 
order one,” or I(1). (See Hamilton (1994) or Wooldridge (1994a) for more on I(1) 
processes.) The error process must still be assumed I(0) (weakly dependent). For 
example, in the scalar case, if {xx : t= 1,2,...} is I(1) without a drift term, then 
T? YL Ef(xu — ¥)°] = O(1) and Var(x;) = O(T). If we maintain that {uj : t = 
1,2,...} is 1(0), then Var(#;) = O(T~'), and simple algebra shows that rrg(T) = 
O(T~'). The orders of magnitude for moments of averages of I(1) series can be in- 
ferred from Hamilton (1994, Prop. 17.1), for example, or shown directly. Harris and 
Tzavalis (1999) established the O(7~') rate for the inconsistency for the autore- 
gressive model with f = 1—in fact, they show rre(T) = —3/(T + 1)—although the 
process must be without linear trend when £ = 1. 

One way to summarize the O(T~') inconsistency for pp when —1 < $ < 1 is that 
it applies to the model yj = yi t-1 + (1 — B)a; + uz, so that when 6 = 1, {Yi : t = 
0,1,...} is I(1) without drift (and therefore its mean is E(y;o) for all £). The FD esti- 
mator of f has inconsistency O(1) in such models, and more generally with I(1) 
regressors if strict exogeneity fails. 

The previous analysis certainly favors FE estimation when contemporaneous exo- 
geneity holds but strict exogeneity fails, even if { yp} and some elements of {x;,} have 
unit roots. Unfortunately, there is a catch: the finding that rzg(7) = O(T~') depends 
critically on the idiosyncratic errors {uw} being an I(0) sequence. In the terminology 
of time series econometrics, y; and X; must be “cointegrated” (see Hamilton, 1994, 
Chap. 19). If, in a time series sense, our model represents a spurious regression—that 
is, there is no value of f} such that y; — x;,8 — ci is 1(0)—then the FE approach is no 
longer superior to FD when comparing inconsistencies. In fact, for fixed N, the spu- 
rious regression problem for FE becomes more acute as T gets large. By contrast, FD 
removes any unit roots in y; and x;,, and so spurious regression is not an issue (but 
lack of strict exogeneity might be). 

When T and N are similar in magnitude, a more realistic scenario is to let T and N 
grow at the same time (and perhaps the same rate). In this scenario, convergence 
results for partial sums of I(1) processes and functions of them are needed to obtain 
limiting distribution results. Considering large T asymptotics is beyond the scope of 
this text. Phillips and Moon (1999, 2000) discuss unit roots, spurious regression, 
cointegration, and a variety of estimation and testing procedures for panel data. See 
Baltagi (2001, Chap. 12) for a summary. 

Because strict exogeneity plays such an important role in FE and FD estimation 
with small T, it is important to have a way of formally detecting its violation. One 
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possibility is to directly compare fre and ba via a Hausman test. It could be im- 
portant to use a robust form of the test that maintains neither Assumption FE.3 nor 
Assumption FD.3, so that neither estimator is assumed efficient under the null. The 
Hausman test has no systematic power for detecting violation of the second moment 
assumptions (either FE.3 or FD.3); it is consistent only against alternatives where 
E(x;,uir) # 0 for some (s, t) pairs. 

Even if we assume that FE or FD is asymptotically efficient under the null, a 
drawback to the traditional form of the Hausman test is that, with aggregate time 
effects, the asymptotic variance of VN(Brr — Bep) is singular. In fact, in a model 
with only aggregate time effects, it can be shown that the FD and FE estimators are 
identical. Problem 10.6 asks you to work through the statistic in the case without 
aggregate time effects. Here, we focus on regression-based tests, which are easy to 
compute even with time-period dummies and easy to make fully robust. 

If T = 2, it is easy to test for strict exogeneity. In the equation Ay; = Ax;f + Auj, 
neither x; nor x;2 should be significant as additional explanatory variables in the FD 
equation. We simply add, say, x;2 to the FD equation and carry out an F test for 
significance of x;2. With more than two time periods, a test of strict exogeneity is a 
test of Ho: y = 0 in the expanded equation 


Avi = AXuB + wiry + Au,  t=2,...,T, 


where w; is a subset of x; (that would exclude time dummies). Using the Wald 
approach, this test can be made robust to arbitrary serial correlation or hetero- 
skedasticity; under Assumptions FD.1—FD.3 the usual F statistic is asymptotically 
valid. 

A test of strict exogeneity using fixed effects, when T > 2, is obtained by specifying 
the equation 


Vit = XB + Wi +10 + Ci + Uir, t=1,2,...,7—-1, 


where w; ;,1 is again a subset of x; -,;. Under strict exogeneity, ô = 0, and we can 
carry out the test using FE estimation. (We lose the last time period by leading wz.) 
An example is given in Problem 10.12. 

Under strict exogeneity, we can use a GLS procedure on either the time-demeaned 
equation or the FD equation. If the variance matrix of u; is unrestricted, it does not 
matter which transformation we use. Intuitively, this point is pretty clear, since 
allowing E(u,u;) to be unrestricted places no restrictions on E(ü;ü;) or E(Au;Au/). Im, 
Ahn, Schmidt, and Wooldridge (1999) show formally that the FEGLS and FDGLS 
estimators are asymptotically equivalent under Assumptions FE.1 and FEGLS.3 and 
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the appropriate rank conditions. Of course the system homoskedasticity assumption 
E(u,u; | x;, c;) = E(uju;) is maintained. 


10.7.2 Relationship between the Random Effects and Fixed Effects Estimators 


In cases where the key variables in x; do not vary much over time, FE and FD 
methods can lead to imprecise estimates. We may be forced to use random effects 
estimation in order to learn anything about the population parameters. If an RE 
analysis is appropriate—that is, if c; is orthogonal to x;,—then the RE estimators can 
have much smaller variances than the FE or FD estimators. We now obtain an 
expression for the RE estimator that allows us to compare it with the FE estimator. 

Using the fact that jir = T, we can write © under the random effects structure as 


Q = lr + oirir = ofr + Taig (iin) ir 


= o,Ir + To¿Pr = (6, + Toz)(Pr + 1Qr), 


where Pr = Ir — Qr = įr(jpir) i, and 4 =02/(62+To2). Next, define Sr = 
Pr + 7Qr. Then S7! = Pr + (1/7)Qr, as can be seen by direct matrix multiplica- 
tion. Further, s7” ? = Pr+(1/ Vn)Qr, because multiplying this matrix by itself 
gives S;' (the matrix is clearly symmetric, since Py and Qy are symmetric). After 
simple algebra, it can be shown that s7” ? = (1 — A) ' [Ir — APr], where 2 = 1 — v^. 
Therefore, 


Q2 = (02 + To?) "PAI — 4) Ir — 4Pr] = (1/04) [Ir — 4P), 


where 4 = 1 — [a2 / (02 + To2)|'/?. Assume for the moment that we know 4. Then the 
RE estimator is obtained by estimating the transformed equation Cry; = CrXif + 
Cry; by system OLS, where Cr = [Ir — AP 7]. Write the transformed equation as 


yi =X P+. (10.78) 


The variance matrix of V; is E(¥;¥;) = CrQCr = a7Ir, which verifies that yï; has 
variance matrix ideal for system OLS estimation. 

The tth element of ¥; is easily seen to be y; — A¥,, and similarly for X;. Therefore, 
system OLS estimation of equation (10.78) is just pooled OLS estimation of 


Vit — AV: = (Xir — ARi)B + (vie — A;) 


over all ¢ and i. The errors in this equation are serially uncorrelated and homo- 
skedastic under Assumption RE.3; therefore, they satisfy the key conditions for 
pooled OLS analysis. The feasible RE estimator replaces the unknown 4 with its es- 
timator, A; so that Êre can be computed from the pooled OLS regression 
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Jaon žm,  t=1,...,Ti=1,.-.,N, (10.79) 


where now X; = Xj — AX; and y,, = Yi — Ai all t and i. Therefore, we can write 


. N T JL/N T 
fae ($s) (Sys) (10.80) 


The usual variance estimate from the pooled OLS regression (10.79), SSR/(NT — K), 
is a consistent estimator of o2. The usual ¢ statistics and F statistics from the pooled 
regression are asymptotically valid under Assumptions RE.1—RE.3. For F tests, we 
obtain Â from the unrestricted model. 

Equation (10.80) shows that the RE estimator is obtained by a quasi-time demean- 
ing: rather than removing the time average from the explanatory and dependent 
variables at each t, RE estimation removes a fraction of the time average. If Å is close 
to unity, the RE and FE estimates tend to be close. To see when this result occurs, 
write Â as 


î=1— {1/0 + 7(62/62))}'”, (10.81) 


where G? and ô? are consistent estimators of a? and a? (see Section 10.4). When 
T (62/62) is large, the second term in Â is small, in which case Â is close to unity. In 
fact, 4 1 as T > œ or as 62/62 — oo. For large T, it is not surprising to find 
similar estimates from FE and RE. Even with small T, RE can be close to FE if the 
estimated variance of c; is large relative to the estimated variance of ui, a case often 
relevant for applications. (As 1 approaches unity, the precision of the RE estimator 
approaches that of the FE estimator, and the effects of time-constant explanatory 
variables become harder to estimate.) 


Example 10.7 (Job Training Grants): In Example 10.4, T = 3, 6? ~ .248, and 
ê? = 1.932, which gives À ~ .797. This helps explain why the RE and FE estimates 
are reasonably close. 


Equations (10.80) and (10.81) also show how RE and pooled OLS are related. 
Pooled OLS is obtained by setting Â = 0, which is never exactly true but could be 
close. In practice, Å is not usually close to zero because small values require ô? to be 
large relative to 6?. 

In Section 10.4 we emphasized that consistency of RE estimation hinges on the 
orthogonality between c; and xy. In fact, Assumption POLS.1 is weaker than As- 
sumption RE.1. We now see, because of the particular transformation used by the 
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RE estimator, that its inconsistency when Assumption RE.1b is violated can be small 
relative to pooled OLS if a? is large relative to a? or if T is large. 

If we are primarily interested in the effect of a time-constant variable in a panel 
data study, the robustness of the FE estimator to correlation between the unobserved 
effect and the xy is practically useless. Without using an instrumental variables 
approach—something we take up in Chapter 11—RE estimation is probably our 
only choice. Sometimes, applications of the RE estimator attempt to control for the 
part of c; correlated with x; by including dummy variables for various groups, 
assuming that we have many observations within each group. For example, if we 
have panel data on a group of working people, we might include city dummy vari- 
ables in a wage equation. Or, if we have panel data at the student level, we might in- 
clude school dummy variables. Including dummy variables for groups controls for a 
certain amount of heterogeneity that might be correlated with the (time-constant) 
elements of x. By using RE, we can efficiently account for any remaining serial 
correlation due to unobserved time-constant factors. (Unfortunately, the language 
used in empirical work can be confusing. It is not uncommon to see school dummy 
variables referred to as “school fixed effects,” even though they appear in an RE 
analysis at the individual level.) 

Regression (10.79) using the quasi-time-demeaned data has several other practical 
uses. Since it is just a pooled OLS regression that is asymptotically the same as using 
2 in place of Â, we can easily obtain standard errors that are robust to arbitrary het- 
eroskedasticity in c; and uy, as well as arbitrary serial correlation in the {up}. All that 
is required is an econometrics package that computes robust standard errors, t, and F 
statistics for pooled OLS regression, such as Stata. Further, we can use the residuals 
from regression (10.79), say Îr, to test for serial correlation in rj, = vj, — 26;, which 
are serially uncorrelated under Assumption RE.3a. If we detect serial correlation 
in {rj}, we conclude that Assumption RE.3a is false, which means that the uj; 
are serially correlated. Although the arguments are tedious, it can be shown that 
estimation of 4 and £ has no effect on the null limiting distribution of the usual (or 
heteroskedasticity-robust) ¢ statistic from the pooled OLS regression fy on fit-1, 
b= Rra ker eal err N: 


10.7.3 Hausman Test Comparing Random Effects and Fixed Effects Estimators 


Because the key consideration in choosing between an RE and an FE approach is 
whether c; and x; are correlated, it is important to have a method for testing this 
assumption. Hausman (1978) proposed a test based on the difference between the RE 
and FE estimates. Since FE is consistent when c; and Xx; are correlated, but RE is 
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inconsistent, a statistically significant difference is interpreted as evidence against the 
RE assumption RE.1b. This interpretation implicitly assumes that {xx} is strictly 
exogenous with respect to {uj}, that is, Assumption RE.1a (FE.1) holds. 

Before we obtain the Hausman test, there are three caveats. First, strict exogeneity, 
Assumption RE.la, is maintained under the null and the alternative. Correlation 
between x;, and uy for any s and ¢ causes both FE and RE to be inconsistent, and 
generally their plims will differ. In order to view the Hausman test as a test of 
Cov(x;, ci) = 0, we must maintain Assumption RE. la. 

A second caveat is that the test is usually implemented assuming that Assumption 
RE.3 holds under the null. As we will see, this setup implies that the RE estimator is 
more efficient than the FE estimator, and it simplifies computation of the test statis- 
tic. But we must emphasize that Assumption RE.3 is an auxiliary assumption, and it 
is not being tested by the Hausman statistic: the Hausman test has no systematic 
power against the alternative that Assumption RE.1 is true but Assumption RE.3 is 
false. Failure of Assumption RE.3 causes the usual Hausman test to have a non- 
standard limiting distribution, an issue we return to in a later discussion. 

A third caveat concerns the set of parameters that we can compare. Because the 
FE approach only identifies coefficients on time-varying explanatory variables, we 
clearly cannot compare FE and RE coefficients on time-constant variables. But there 
is a more subtle issue: we cannot include in our comparison coefficients on aggregate 
time effects—that is, variables that change only across ¢. As with the case of com- 
paring FE and FD estimates, the problem with comparing coefficients on aggregate 
time effects is not one of identification; we know RE and FE both allow inclusion of 
a full set of time period dummies. The problem is one of singularity in the asymptotic 
variance matrix of the difference between fpg and Bp. Problem 10.17 asks you to 
show that in the model y; = «+ d; + wid + ci + ui, where d; is a 1 x R vector of 
aggregate time effects and w; is a 1 x M vector of regressors varying across i and 1, 
the asymptotic variance of the difference in the FE and RE estimators of (9',6’)' has 
rank M, not R + M. (In fact, without w; in the model, the FE and RE estimates of y 
are identical. Problem 10.18 asks you to investigate this algebraic result and some 
related claims for a particular data set.) 

Rather than consider the general case—which is complicated by the singularity 
of the asymptotic covariance matrix—we assume that there are no aggregate time 
effects in the model in deriving the traditional form of the Hausman test. Therefore, 
we write the equation as yj = Xip + Ci + Uit = Ziy + Wid + Ci + Ux, Where z; (say, 
1 x J) includes at least an intercept and often other time-constant variables. The 
elements of the 1 x M vector wp vary across i and ¢ (as usual, at least for some units 
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and some time periods), and it is the FE and RE estimators of 6 that we compare. 
For deriving the traditional form of the test, we maintain Assumptions RE.1 to RE.3 
under the null, as well as the rank condition for fixed effects, Assumption FE.2. 

A key component of the traditional Hausman test is showing that the asymptotic 
variance of the FE estimator is never smaller, and is usually strictly larger, than the 
asymptotic variance of the RE estimator. We already know the asymptotic variance 
of the FE estimator: 


Avar(ôre) = o2[E(W!W,)] /N. (10.82) 


In the presence of z; the derivation of Avar(dpz) is a little more tedious, but sim- 
plified by the pooled OLS characterization of RE using the quasi-time-demeaned 
data. Define W; = wi; — AW; as the quasi-time demeaned time-varying regressors. To 
get Avar(dgz), we must obtain the population residuals from the pooled regression 
wi, on (1 — 2)z;, which, of course, is the same as dropping the (1 — 4). Call these 
population residuals w;;. Then 


Avar(dre) = o2[E(W/W,)| /N. (10.83) 
Next, we want to show that Avar(ôrg) - Avar(drr) is positive definite. But this 
holds if [Avar(drz)| | — [Avar(dre)| | is positive definite, or E(W/W;) — E(W/W)) is 
positive definite. To demonstrate the latter result, we need to more carefully charac- 
terize Wir = Wi — zI, where M = [T - E(z/z;)|'E[z/(S>, wio) = (1 — 2)[E(aiz,)] | - 
E(z;w;). Straightforward algebra gives W; = Wi — (1 — 2)w;, where w; = L(W; | z;) is 
the linear projection of w; on z;. We can also write 

Wir = Wir + (1 — 4) (Wi — W;), (10.84) 


from which it follows immediately 


E(W!W,) — E(W!W,) = (1 — AV ENW; — w7) (W; — W7)] 


T T 
-4X w + (1-4) (W W) Wi 


t=1 t=1 
= (1 — A) E[(W, — W?) (W: — WI), (10.85) 


because SE Wit = 0 for all i. For å< 1, and provided w; — L(W;|z;) has vari- 
ance with full rank—which means the time averages of the time-varying regressors 
are not perfectly collinear with the time-constant regressors—we have shown that 
E(W/W,) — E(W/W,) is positive definite. 
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It can also be shown that, under RE.1 to RE.3, Avar(dre — dre) = Avar(ôrg) — 
Avar(ôrg); Newey and McFadden (1994, Sect. 5.3) provide general sufficient con- 
ditions, which are met by the FE and RE estimators under Assumptions RE.1—RE.3. 
(We cover these conditions in Chapter 14 in our discussion of general efficiency 
issues; see Lemma 14.1 and the surrounding discussion.) Therefore, we can compute 
the Hausman statistic as 


— —. 


H= (Org = bre) [Avar(ôFE) = Avar(dre)| (Ore = ÔRE), (10.86) 


which has a 73, distribution under the null hypothesis. The usual estimates of 
Avar(ôrg) and Avar(ôre) can be used in equation (10.86), but if different estimates 
of o? are used, the matrix in the middle need not be positive definite, possibly leading 
to a negative value of H. It is best to use the same estimate of g? (based on either FE 
or RE) in both places. 

If we are interested in testing the difference in estimators for a single coefficient, 
we can use a f statistic form of the Hausman test. In particular, if ô now denotes a 
scalar on the time-varying variable of interest, we can use (Ogg — dre) /{|se(Ore)|° — 
[se(Spe)]?}1/ ? provided that we use versions of the standard errors that ensure 
se(Orz) > se(Orz). Under Assumptions RE.1-RE.3, the ¢ statistic has an asymptotic 
standard normal distribution. 

So far we have stated that the null hypothesis is RE.1—RE.3, but expression (10.84) 
allows us to characterize the implicit null hypothesis. From (10.84), it is seen that 
deviations between the RE and FE estimates of 6 are due to correlation between 
Ww; — W; and c; (where we maintain Assumption RE.1la, which is strict exogeneity 
conditional on c;). In other words, the Hausman test is a test of 


Ho : E[(w; — W7)'c] = 0. (10.87) 


Equation (10.87) is interesting for several reasons. First, if there are no time-constant 
variables (except an overall intercept) in the RE estimation, the null hypothesis is 
Cov(W;, ci) = 0, which means we are really testing whether the time-average of the w; 
is correlated with the unobserved effect. With time-constant explanatory variables z;, 
we first remove the correlation between W; and z; to form the population residuals, 
W; — W, before testing for correlation with c;. An immediate consequence is that, 
with a rich set of controls in z;, it is possible for W; — W; to be uncorrelated with 
c; even though W; is correlated with c;. Not surprisingly, an RE analysis with 
good controls in z; and an RE analysis that omits such controls can yield very dif- 
ferent estimates of 6, and the RE estimate of 6 with z; included might be much closer 
to the FE estimate than if z; is excluded. Often, discussions of the Hausman test 
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comparing FE and RE assume that only time-varying explanatory variables are 
included in the RE estimation, but that is too restrictive for applications with time- 
constant controls. 

Equation (10.87) also suggests a simple regression-based approach to computing a 
Hausman statistic. In fact, we are led to a regression-based method if we use a par- 
ticular correlated RE assumption due to Mundlak (1978): c; = Y + wié + ai, where a; 
has zero mean and is assumed to be uncorrelated with w; = (wi1,...,Wir) and z;. 
Plugging in for c; gives the expanded equation 


Vit = XinB + Wig + di + uir, (10.88) 


where we absorb y into J because we assume X; includes an intercept. In fact, in 
addition to an intercept, time-constant variables z;, and wij, x; can (and usually 
should) contain a full set of time period dummies. Mundlak (1978) suggested testing 
Ho : č = 0 to determine if the heterogeneity was correlated with the time averages of 
the w». (Incidentally, the formulation in equation (10.88) makes it clear that we 
cannot include the time average of aggregate time variables in W; because the time 
averages simply would be constant across all i.) 

How should we estimate the parameters in equation (10.88)? We could estimate the 
equation by pooled OLS or we could estimate the equation by random effects, which 
is asymptotically efficient under RE.1—RE.3. As it turns out, the pooled OLS and RE 
estimates of & are identical, and ê= ôg- ô FE, where Op (the between estimator) is the 
coefficient vector on W; from the cross section regression J; on z; W; i= 1,...,N. 
Further, the coefficient vector on wy is simply the FE estimate; see Mundlak (1978) 
and Hsiao (2003, Sect. 3.2) for verification. In other words, the regression-based 
version of the test explicitly compares the between estimate and the FE estimate of ô. 
Hausman and Taylor (1981) show that this is the same as basing the test on the RE 
and FE estimate because the RE estimator is a matrix-weighted average of the be- 
tween and FE estimators; see also Baltagi (2001, Sect. 2.3). 

That the FE estimator of d—the coefficient on w;,—1is obtained as the RE estima- 
tor in equation (10.88) sheds further light on the source of efficiency of RE over 
FE. In effect, the RE estimator of 6 sets č to zero. Because W; is correlated with 
w;,—often highly correlated—dropping it from the equation when it is legimate to 
do so increases efficiency of the remaining parameter estimates by reducing multi- 
collinearity. Of course, inappropriately setting € to zero results in inconsistent esti- 
mation of ô, and that is the danger of the RE approach. 

If we use pooled OLS to test Ho : č = 0, the usual pooled OLS test statistic will be 
inappropriate—at a minimum because of the serial correlation induced by a;. As we 
know, it is easy to obtain a fully robust variance matrix estimator, and therefore a 
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fully robust Wald statistic, for pooled OLS. Such a statistic will be fully robust to 
violations of Assumption RE.3. Alternatively, we can estimate (10.88) by RE, and, if 
we maintain Assumption RE.3, we can use a standard Wald test computed using the 
usual RE variance matrix estimator. We also know it is easy to obtain a fully robust 
Wald statistic, that is, a statistic that does not maintain Assumption RE.3 under the 
null. The robust Wald statistic based on the RE estimator is necessarily asymptoti- 
cally equivalent to the robust Wald statistic for pooled OLS. 

Earlier we discussed how the traditional Hausman test maintains Assumption 
RE.3 under the null. Why should we make the Hausman robust to violations to 
Assumption RE.3? There is some confusion on this point in the methodolgical and 
empirical literatures. Hausman (1978) originally presented his testing principle as 
applying to situations where one estimator is efficient under the null but inconsistent 
under the alternative—the RE estimator in this case—and the other estimator is 
consistent under the null and the alternative but inefficient under the null—the FE 
estimator. While this scenario simplifies calculation of the test statistic—see equa- 
tion (10.86)—it by no means is required for the test to make sense. In the current 
context, whether or not the covariates are correlated with c; (through the time 
average) has nothing to do with conditional second moment assumptions on c; or 
U; = (ui,.--, uir) —that is, on whether Assumption RE.3 holds. In fact, including 
Assumption RE.3 in the null hypothesis serves to mask the kinds of misspecification 
the traditional Hausman test can detect. Certainly the test has power against viola- 
tions of (10.87)—this is what it is intended to have power against. But the nonrobust 
Hausman statistic, whether computed in (10.86) or from (10.88)—they are asymp- 
totically equivalent under RE.3—has no systematic power for detecting violation of 
Assumption RE.3. Specifically, if (10.87) holds and the standard rank conditions 
hold for FE and RE, the statistic converges in distribution (rather than diverging) 
whether or not Assumption RE.3 holds. In other words, the test is inconsistent for 
testing RE.3. This is easily seen from (10.86). The quadratic form will converge in 
distribution to a quadratic form in a multivariate normal. If Assumption RE.3 holds, 
that quadratic form has a chi-square distribution with M degrees of freedom, but not 
otherwise. Without Assumption RE.3, using the xj, distribution to obtain critical 
values, or p-values, will result in a test that will be undersized or oversized under 
(10.87)—and we cannot generally tell which is the case. This feature carries over to 
any Hausman statistic where an auxiliary assumption is maintained that means one 
estimator is asymptotically efficient under the null. See Wooldridge (1991b) for fur- 
ther discussion in the context of testing conditional mean specifications. 

To summarize, we can estimate models that include aggregate time effects, time- 
constant variables, and regressors that change across both i and ¢, by RE and FE 
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estimation. But no matter how we compute a test statistic, we can only compare the 
coefficients on the regressors that change across both i and t. The regression-based 
version of the test in equation (10.88) makes it easy to obtain a statistic with a non- 
degenerate asymptotic distribution, and it is also easy to make the regression-based 
test fully robust to violations of Assumption RE.3. 

As in any other context that uses statistical inference, it is possible to get a statis- 
tical rejection of RE.1b (say, at the 5 percent level) with the differences between the 
RE and FE estimates being practically small. The opposite case is also possible: there 
can be seemingly large differences between the RE and FE estimates but, due to large 
standard errors, the Hausman statistic fails to reject. What should be done in this 
case? A typical response is to conclude that the random effects assumptions hold and 
to focus on the RE estimates. Unfortunately, we may be committing a Type IJ error: 
failing to reject Assumption RE.1b when it is false. 


Problems 


10.1. Consider a model for new capital investment in a particular industry (say, 
manufacturing), where the cross section observations are at the county level and there 
are T years of data for each county: 


log(investi,) = 0, + Ziy + 0\taxy + Ordisasterj, + Ci + Uit. 


The variable tax;; is a measure of the marginal tax rate on capital in the county, and 
disaster; is a dummy indicator equal to one if there was a significant natural disaster 
in county i at time period ¢ (for example, a major flood, a hurricane, or an earth- 
quake). The variables in z are other factors affecting capital investment, and the 0, 
represent different time intercepts. 


a. Why is allowing for aggregate time effects in the equation important? 
b. What kinds of variables are captured in c;? 


c. Interpreting the equation in a causal fashion, what sign does economic reasoning 
suggest for 6)? 

d. Explain in detail how you would estimate this model; be specific about the 
assumptions you are making. 

e. Discuss whether strict exogeneity is reasonable for the two variables taxi, 
and disasterj;; assume that neither of these variables has a lagged effect on capital 
investment. 


10.2. Suppose you have T = 2 years of data on the same group of N working indi- 
viduals. Consider the following model of wage determination: 
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log(wagei) = 01 + 02d2, + Ziy +0 female; + d2d2,+ female; + ci + tir. 


The unobserved effect c; is allowed to be correlated with z;, and female;. The variable 
d2, is a time period indicator, where d2, = 1 if t= 2 and d2, = 0 if t= 1. In what 
follows, assume that 


E(u | female;, zi, Zi2, ci) = 0, t= 1,2. 

a. Without further assumptions, what parameters in the log wage equation can be 
consistently estimated? 

b. Interpret the coefficients 0» and 6). 

c. Write the log wage equation explicitly for the two time periods. Show that the 
differenced equation can be written as 


Alog(wage;) = 0. + Aziy + 62 female; + Auj, 


where Alog(wage;) = log(wagej2) — log(wage;), and so on. 


d. How would you test Ho : ô = 0 if Var(Au; | Az;, female;) is not constant? 
10.3. For T = 2 consider the standard unoberved effects model 
Vit = Xüuß + Ci + Ui, t= 1,2. 


Let Bj, and B,p denote the fixed effects and first difference estimators, respectively. 
a. Show that the FE and FD estimates are numerically identical. 


b. Show that the error variance estimates from the FE and FD methods are numer- 
ically identical. 


10.4. A common setup for program evaluation with two periods of panel data is the 
following. Let y, denote the outcome of interest for unit į in period ¢. At t= 1, no 
one is in the program; at t = 2, some units are in the control group and others are in 
the experimental group. Let progi be a binary indicator equal to one if unit i is in the 
program in period t; by the program design, proga = 0 for all i. An unobserved 
effects model without additional covariates is 


Vie = 91 + O2d2, + Oiprogit + Ci + Uit, E(ui | progi2, ci) = 0, 
where d2, is a dummy variable equal to unity if t = 2, and zero if t = 1, and c; is the 


unobserved effect. 


a. Explain why including d2, is important in these contexts. In particular, what 
problems might be caused by leaving it out? 


b. Why is it important to include c; in the equation? 


336 Chapter 10 


c. Using the first differencing method, show that 62 = AV miro) and 61 = AV pea — 
Ay where AY onrro/ iS the average change in y over the two periods for the group 
with progi = 0, and A V,eat is the average change in y for the group where prog. = 
1. This formula shows that 61, the difference-in-differences estimator, arises out of an 
unobserved effects panel data model. 


control > 


d. Write down the extension of the model for T time periods. 


e. A common way to obtain the DD estimator for two years of panel data is from 
the model 


Vip = %1 + Start, + a3prog; + O\startprog; + it, (10.89) 


where E(u; | start;, progi) = 0, prog; denotes whether unit 7 is in the program in the 
second period, and start, is a binary variable indicating when the program starts. In 
the two-period setup, start, = d2, and progi = start,prog;. The pooled OLS estimator 
of ô is the DD estimator from part c. With T > 2, the unobserved effects model from 
part d and pooled estimation of equation (10.89) no longer generally give the same 
estimate of the program effect. Which approach do you prefer, and why? 


10.5. Assume that Assumptions RE.1 and RE.3a hold, but Var(c; | x;) # Var(c;). 


a. Describe the general nature of E(v;v; | xj). 


b. What are the asymptotic properties of the random effects estimator and the asso- 
ciated test statistics? 


c. How should the random effects statistics be modified? 


10.6. For a model where x; varies across i and t¢, define the K x K symmetric 
matrices Ay = E(AX;AX;) and Ay = E(X;X;), and assume both are positive definite. 
Define ô = (Bry, Bre)’ and 0 = (f’, B’)', both 2K x 1 vectors. 

a. Under Assumption FE.1 (and the rank conditions we have given), find /N( — 0) 
in terms of Ay, A2, N~'/2 DX | AX/Au;, and N~!/? YOY X‘ii; [with a 0, (1) remainder]. 
b. Explain how to consistently estimate Avar VN (6 — 0) without further assumptions. 
c. Use parts a and b to obtain a robust Hausman statistic comparing the FD and FE 
estimators. What is the limiting distribution of your statistic under Ho? 

d. If xj = (d,, wi), where d; is a 1 x R vector of aggregate time variables, can you 
compare all of the FD and FD estimates of p? Explain. 


10.7. Use the two terms of data in GPA.RAW to estimate an unobserved effects 
version of the model in Example 7.8. You should drop the variable cumgpa (since this 
variable violates strict exogeneity). 
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a. Estimate the model by RE, and interpret the coefficient on the in-season variable. 
b. Estimate the model by FE; informally compare the estimates to the RE estimates, 
in particular that on the in-season effect. 


c. Construct the nonrobust Hausman test comparing RE and FE. Include all vari- 
ables in wy that have some variation across i and f, except for the term dummy. 


10.8. Use the data in NORWAY.RAW for the years 1972 and 1978 for a two-year 
panel data analysis. The model is a simple distributed lag model: 


log(crimejr) = 09 + O1d78; + Piclrprci t1 + Byclrprcj,1-2 + Ci + tir. 


The variable c/rprc is the clear-up percentage (the percentage of crimes solved). 
The data are stored for two years, with the needed lags given as variables for each 
year. 

a. First estimate this equation using a pooled OLS analysis. Comment on the deter- 
rent effect of the clear-up percentage, including interpreting the size of the coeffi- 
cients. Test for serial correlation in the composite error v; assuming strict exogeneity 
(see Section 7.8). 


b. Estimate the equation by FE, and compare the estimates with the pooled OLS 
estimates. Is there any reason to test for serial correlation? Obtain heteroskedasticity- 
robust standard errors for the FE estimates. 


c. Using FE analysis, test the hypothesis Ho : 6; = 2. What do you conclude? If the 
hypothesis is not rejected, what would be a more parsimonious model? Estimate this 
model. 

10.9. Use the data in CORNWELL.RAW for this problem. 


a. Estimate both an RE and an FE version of the model in Problem 7.11a. Compute 
the regression-based version of the Hausman test comparing RE and FE. 


b. Add the wage variables (in logarithmic form), and test for joint significance after 
estimation by FE. 


c. Estimate the equation by FD, and comment on any notable changes. Do the 
standard errors change much between FE and FD? 


d. Test the FD equation for AR(1) serial correlation. 


10.10. An unobserved effects model explaining current murder rates in terms of the 
number of executions in the last three years is 


mrdrte;, = 0, + By execi, + Byunemi, + ci + tit, 


338 Chapter 10 


where mrdrte;, is the number of murders in state 7 during year t, per 10,000 people; 
exec; is the total number of executions for the current and prior two years; and 
unem is the current unemployment rate, included as a control. 

a. Using the data in MURDER.RAW, estimate this model by FD. Notice that you 
should allow different year intercepts. Test the errors in the FD equation for serial 
correlation. 

b. Estimate the model by FE. Are there any important differences from the FD 
estimates? 


c. Under what circumstances would exec; not be strictly exogenous (conditional on 
ci)? 


10.11. Use the data in LOWBIRTH.RAW for this question. 
a. For 1987 and 1990, consider the state-level equation 


lowbrthi, = 0, + 0.490, + P afdcprci + By log(phypcir) 
+ B; log(bedspci:) + By log(pcinci) + Bs log(populj,) + ci + tit, 


where the dependent variable is percentage of births that are classified as low birth 
weight and the key explanatory variable is afdcprc, the percentage of the population 
in the welfare program, Aid to Families with Dependent Children (AFDC). The 
other variables, which act as controls for quality of health care and income levels, are 
physicians per capita, hospital beds per capita, per capita income, and population. 
Interpretating the equation causally, what sign should each £, have? (Note: Partici- 
pation in AFDC makes poor women eligible for nutritional programs and prenatal 
care.) 


b. Estimate the preceding equation by pooled OLS, and discuss the results. You 
should report the usual standard errors and serial correlation—robust standard errors. 


c. Difference the equation to eliminate the state FE, c;, and reestimate the equation. 
Interpret the estimate of 6, and compare it to the estimate from part b. What do you 
make of f,? 


d. Add afdcprc? to the model, and estimate it by FD. Are the estimates on afdcpre 
and afdcprc? sensible? What is the estimated turning point in the quadratic? 


10.12. The data in WAGEPAN.RAW are from Vella and Verbeek (1998) for 545 
men who worked every year from 1980 to 1987. Consider the wage equation 


log(wage;,) = 0, + B,educ; + f,black; + Byhisp; + Byexperir 


+ Bsexper?. + Bemarried;, + Bzuniony + Ci + Uit. 
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The variables are described in the data set. Notice that education does not change 
over time. 


a. Estimate this equation by pooled OLS, and report the results in standard form. 
Are the usual OLS standard errors reliable, even if c; is uncorrelated with all ex- 
planatory variables? Explain. Compute appropriate standard errors. 


b. Estimate the wage equation by RE. Compare your estimates with the pooled OLS 
estimates. 


c. Now estimate the equation by FE. Why is exper; redundant in the model even 
though it changes over time? What happens to the marriage and union premiums as 
compared with the RE estimates? 


d. Now add interactions of the form d8/-educ, d82-educ,...,d87-educ and estimate 
the equation by FE. Has the return to education increased over time? 


e. Return to the original model estimated by FE in part c. Add a lead of the union 
variable, union; „1 to the equation, and estimate the model by FE (note that you lose 
the data for 1987). Is union; +1 significant? What does your finding say about strict 
exogeneity of union membership? 


f. Return to the original model, but add the interactions blacki- unioni and 
hisp; - union;,. Do the union wage premiums differ by race? Obtain the usual FE sta- 
tistics and those fully robust to heteroskedasticity and serial correlation. 

g. Add union; +1 to the equation from part f, and obtain a fully robust test of the 
hypothesis that {wnion;, : t= 1,..., T} is strictly exogenous. What do you conclude? 


10.13. Consider the standard linear unobserved effects model (10.11), under the 
assumptions 


E(uir | X;, hj, c;) = 0, Var(ui | Xi, hj, ci) = a hit, t= 1, sey T, 


where h; = (/j1,..., 4:7). In other words, the errors display heteroskedasticity that 
depends on hy. (In the leading case, hj, is a function of x;;.) Suppose you estimate f 
by minimizing the weighted sum of squared residuals 


T 
Sov ayd1;— ++» — aydN; — Xub)? [hi 


i=l t= 


with respect to the a;, i = 1,...,N and b, where dn; = 1 if i = n. (This would seem to 
be the natural analogue of the dummy variable regression, modified for known het- 
eroskedasticity. We might call this a fixed effects weighted least squares estimator.) 
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a. Show that the FEWLS estimator is generally consistent for fixed T as N —> œ. 
Do you need the variance function to be correctly specified in the sense that 
Var (uit | xi, hi, ci) = o2hz, t = 1,..., T? Explain. 

b. Suppose the variance function is correctly specified and Cov(wj,, Uis | X;, h;, c;) = 0, 
t # s. Find the asymptotic variance of VN(Brewzs — P). 


c. Under the assumptions of part b, how would you estimate a? and 
Avar[VM(Brewzs — p)? 

d. If the variance function is misspecified, or there is serial correlation in uj, or both, 
how would you estimate Avar[VWN (Brews — P)? 


10.14. Suppose that we have the unobserved effects model 
Vit = A + Xup + Ziy + hi + üi 


where the x; (lx K) are time-varying, the z; (1x M) are time-constant, 
E(u | Xi, Zi hi) =0, t=1,...,7, and E(h;|x;,z;) =0. Let of = Var(h;) and o? = 
Var(uj,). If we estimate f by fixed effects, we are estimating the equation 
Vit = XinB + Ci + ui, where cj = & + Ziy + hi. 

a. Find ø? = Var(c;). Show that a? is at least as large as øf, and usually strictly 
larger. 


b. Explain why estimation of the model by fixed effects will lead to a larger estimated 
variance of the unobserved effect than if we estimate the model by random effects. 
Does this result make intuitive sense? 


c. If A, is the quasi-time-demeaning parameter without z; in the model and 7, is the 
quasi-time-demeaning parameter with z; in the model, show that 2, > 4n, with strict 
inequality if y 4 0. 

d. What does part c imply about using pooled OLS versus FE as the first step esti- 
mator for estimating the variance of the unobserved heterogeneity in RE estimation? 


e. Suppose that, in addition to RE.I—RE.3 holding in the orginal model, 
E(z;|x;) =0, t=1,...,7 and Var(z;|x;) = Var(z;). Show directly—that is, by 
comparing the two asymptotic variances—that the RE estimator that includes z; is 
asymptotically more efficient than the RE estimator that excludes z;. (The result also 
follows from Problem 7.15 without the assumption Var(c; | x;) = War(c;); in fact, we 
only need to assume z; is uncorrelated with x;.) 


10.15. Consider the standard unobserved effects model, first under a stronger ver- 
sion of the RE assumptions. Let vj, = Ci + uz, t= 1,..., T, be the composite errors, 
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as usual. Then, in addition to RE.1—RE.3, assume that the T x 1 vector v; is inde- 
pendent of x; and that all conditional expectations involving vw; are linear. 

a. Let o; be the time average of the v. Argue that E(v;|x;,0;) = E(vi | i) = D; 
(Hint: To show the second equality, recall that, because the expectation is assumed to 
be linear, the coefficient on 0; is Cov(vj,, 0;)/Var(G;).) 


b. Use part a to show that 
E( vir | xi, J) = xah + (F; — 3P). 


c. Argue that estimation of f based on part b leads to the FE estimator. 

d. How can you reconcile part d with the fact that the full RE assumptions have 
been assumed? (Hint: What is the conditional expectation underlying RE estima- 
tion?) 

e. Show that the arguments leading up to part c carry over using linear projections 
under RE.1—RE.3 alone. 


10.16. Assume that through time periods T + 1, the assumptions in Problem 10.15 
hold. Suppose you want to forecast y; r+ı at time T, where you know x; r+ı in 
addition to all past values on x and y. 

a. It can be shown, under the given assumptions, that E(v;,7+41| Xi, Xi, 741, Vil,---, Vi) 
= E(v;,r+1|0;). Use this to show that E(v;, r+1 | 0;) = [o2/(62 + o2/T) |G. 


b. Use part a to derive E(y, 741 |X,- -+ XiT, Xi, 741) ily -++ Vir): 


c. Compare the expectation in part b with E( yj. r+1 | Xa,---,Xi7, Xi, T+1). 


d. Which of the two expectations in parts b and c leads to a better forecast as defined 
by smallest variance of the forecast error? 


e. How would you forecast y; r+ı if you have to estimate all unknown population 
parameters? 


10.17. Consider a standard unobserved effects model but where we explicitly sepa- 
rate out aggregate time effects, say d,, a 1 x R vector, where R < T — 1. (These are 
usually a full set of time period dummies, but they could be other aggregate time 
variables, such as specific functions of time.) Therefore, the model is 


Vie = &+ dey + wird + Ci + Uit, (eral ewer Be 


where wy is the 1 x M vector of explanatory variables that vary across i and t. 
Because the d, do not change across i, we take them to be nonrandom. Because 
we have included an intercept in the model, we can assume that E(c;) = 0. Let 
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2=1— {1/[1 + T(c2/02)}" be the usual quasi-time-demeaning parameter for RE 
estimation. In what follows, we take 2 as known because estimating it does not affect 
the asymptotic distribution results. 


a. Show that we can write the quasi-time-demeaned equation for RE estimation as 


Vit — AY; = Ut (d; — d) + (Wi — AW,)O + (vir — 40i), 

where u = (1 —A)a+ (1 —A)dy, vir = ci + uin, and d = T-! DA d; is nonrandom. 

b. To simplify the algebra without changing the substance of the findings, assume 
that u = 0 and that we exclude an intercept in estimating the quasi-time-demeaned 
equation. Write g, = (d; — d, wy — Aw;) and $ = (q',0’)'. We will study the asymp- 
totic distribution of the RE estimator by using the pooled OLS estimator from 
Yu — AY; On gp t= 1,...,T; i=1,...,N. Show that under Assumptions RE.1 and 
RE.2, 


N T 
VN (Bre — B) = Ap N? 5 5 Bi (vir — ABi) + Op(1), 
i=1 =I 


where A; = X`}; E(gig,,). Further, verify that for any i, 


T: 


Sod = d) (vir = ATi) = 


t=1 t 


(d; = d) Uit. 


Mas 


i 
fu 


c. Show that under FE.1 and FE.2, 
; N T 
VN (Bre — B) = A3 N" S $ hyun + 0p(1), 
i=l tl 


where hi; = (d; — d, wy — W;) and Ay = S77, E(h/h;). 
d. Under RE.1, FE.1, and FE.2, show that A; VN (fre — B) — AVN (Brg — B) has 
an asymptotic variance matrix of rank M rather than R + M. 


e. What implications does part d have for a Hausman test that compares FE and 
RE when the model contains aggregate time variables of any sort? Does it matter 
whether Assumption RE.3 holds under the null? 


10.18. Use the data in WAGEPAN.RAW to answer this question. 


a. Using /wage as the dependent variable, estimate a model that contains an intercept 
and the year dummies d81 through d87. Used pooled OLS, RE, FE, and FD (where 
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in the latter case you difference the year dummies, along with /wage, and omit an 
overall constant in the FD regression). What do you conclude about the coefficients 
on the dummy variables? 

b. Add the time-constant variables educ, black, and hisp to the model, and estimate it 
by POLS and RE. How do the coefficients compare? What happens if you estimate 
the equation by FE? 

c. Are the POLS and RE standard errors from part b the same? Which ones are 
probably more reliable? 

d. Obtain the fully robust standard errors for POLS. Do you prefer these or the usual 
RE standard errors? 

e. Obtain the fully robust RE standard errors. How do these compare with the fully 
robust POLS standard errors, and why? 


l l More Topics in Linear Unobserved Effects Models 


This chapter continues our treatment of linear, unobserved effects panel data models. 
In Section 11.1 we briefly treat the GMM approach to estimating the standard, 
additive effect model from Chapter 10, emphasizing some equivalences between the 
standard estimators and GMM 3SLS estimators. In Section 11.2, we cover estima- 
tion of models where, at a minimum, the assumption of strict exogeneity conditional 
on the unobserved heterogeneity (Assumption FE.1) fails. Instead, we assume we 
have available instrumental variables (IVs) that are uncorrelated with the idiosyn- 
cratic errors in all time periods. Depending on whether these instruments are also 
uncorrelated with the unobserved effect, we are led to random effects or fixed effects 
IV methods. Section 11.3 shows how these methods apply to Hausman and Taylor 
(1981) models, where a subset of explanatory variables is allowed to be endogenous 
but enough explanatory variables are exogenous so that IV methods can be applied. 

Section 11.4 combines first differencing with IV methods. In Section 11.5 we study 
the properties of fixed effects and first differencing estimators in the presence of 
measurement error, and propose some IV solutions. We explicitly cover unobserved 
effects models with sequentially exogenous explanatory variables, including models 
with lagged dependent variables, in Section 11.6. In Section 11.7, we turn to models 
with unit-specific slopes, including the important special case of unit-specific time 
trends. 


11.1 Generalized Method of Moments Approaches to the Standard Linear 
Unobserved Effects Model 


11.1.1 Equivalance between GMM 3SLS and Standard Estimators 

In Chapter 10, we extensively covered estimation of the unobserved effects model 
Vit = Xuß + Ci + uit, E E (11.1) 
which we can write for all T time periods as 

y; = Xf + cir + w = X; + Yi, (11.2) 


where jy is the T x 1 vector of ones, u; is the T x 1 vector of idiosyncratic errors, and 
V; = cijp + u; is the T x 1 vector of composite errors. Random effects (RE), fixed 
effects (FE), and first differencing (FD) are still the most popular approaches to esti- 
mating f in equation (11.1) with strictly exogenous explanatory variables. As we 
saw in Chapter 10, each of these estimators is consistent without restrictions on the 
variance-covariance matrix of the composite (v;) or idiosyncratic (u;) errors. We also 
saw that each estimator is asymptotically efficient under a particular set of assump- 
tions on the conditional second moment matrix of v; in the RE case, u; in the FE 
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case, and Au; in the FD case. Recall that there are two aspects to the second moment 
assumptions that imply asymptotic efficiency. First, the unconditional variance- 
covariance matrix 1s assumed to have a special structure. Second, system homo- 
skedasticity is assumed in all cases. (Loosely, conditional variances and covariances 
do not depend on the explanatory variables.) 

We have already seen how to allow for an unrestricted unconditional variance- 
covariance matrix in the context of RE, FE, and FD: we simply apply unrestricted 
FGLS to the appropriately transformed equations. Nevertheless, efficiency of, 
say, the FEGLS estimator hinges on the system homoskedasticity assumption 
E(u;uj | x;,¢;) = E(uju;). Under RE.1, efficiency of the FGLS estimator with un- 
restricted Var(v;) is ensured only if E(viv;|x;) = E(viv;). If this assumption fails, 
GMM with an optimal weighting matrix is generally more efficient, and so it may be 
worthwhile to apply GMM methods to (11.2) (or, in the case of FE, a suitably 
transformed set of equations.) 

We first suppose that Assumption RE.1 holds, so that xj, is uncorrelated with 
the composite error v; for all s and ¢. (In fact, the zero conditional mean assumptions 
in Assumption RE.1 imply that E(v;|x;)=0, and so any function of x; = 
(Xi1,X2,---,Xir) is uncorrelated with vz. Here, we limit ourselves to linear 
functions.) 

Let x? denote the row vector of nonredundant elements of x;, so that any time- 
constant elements and aggregate time effects appear only once in x?. Then E(x?’v;;) = 
0,¢=1,2,...,7. This orthogonality condition suggests a system IV procedure, with 
matrix of instruments 


Z; =\7 @x?. (11.3) 


In other words, use instruments Z; to estimate equation (11.2) by 3SLS or, more 
generally, by minimum chi-square. 

The matrix (11.3) can contain many instruments. If x; contains only variables that 
change across both i and ż, then Z; is T x T?K. With only K parameters to estimate, 
this choice of instruments implies many overidentifying restrictions even for mod- 
erately sized T. Even if computation is not an issue, using many overidentifying 
restrictions can result in poor finite sample properties. 

In some cases, we can reduce the number of moment conditions without sacrificing 
efficiency. Im, Ahn, Schmidt, and Wooldridge (1999; IASW) show the following re- 
sult. If Ê has the random effects structure—which means we impose the RE structure 
in estimating Q—then GMM 3SLS applied to equation (11.2), using instruments 


Zi = (PrXi, QrW;), (11.4) 
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where Pr =j7(i;i7) ‘i, Qr = Ir — Pr, jp = (1,1,...,1)', and W; is the T x M 
submatrix of X; obtained by removing the time-constant variables, is identical to the 
RE estimator. The column dimension of matrix (11.4) is only K + M, so there are 
only M overidentifying restrictions in using the 3SLS estimator. 

The algebraic equivalence between 3SLS and RE estimation has some useful ap- 
plications. First, it provides a different way of testing the orthogonality between c; 
and x; for all t: after 3SLS estimation, we simply apply the GMM overidentification 
statistic from Chapter 8. (We discussed regression-based tests in Section 10.7.3.) 
Second, it provides a way to obtain a more efficient estimator when Assumption 
RE.3 does not hold. If Q does not have the RE structure (see equation (10.30)), then 
the 3SLS estimator that imposes this structure is inefficient; an unrestricted estimator 
of Q should be used instead. Because an unrestricted estimator of Q is consistent with 
or without the RE structure, 3SLS with unrestricted Q and IVs in matrix (11.4) is no 
less efficient than the RE estimator. Further, if E(v;v; | x;) 4 E(viv;), any 3SLS esti- 
mator is inefficient relative to GMM with the optimal weighting matrix. Therefore, if 
Assumption RE.3 fails, minimum chi-square estimation with IVs in matrix (11.4) 
generally improves on the random effects estimator. In other words, we can gain 
asymptotic efficiency by using only M < K additional moment conditions. 

A different 3SLS estimator can be shown to be equivalent to the FE estimator. In 
particular, IASW (1999, Theorem 4.1) verify an assertion of Arellano and Bover 
(1995): when Q has the random effects form, the GMM 3SLS estimator applied to 
equation (11.2) using instruments Lr ® x?—where Ly is the T x (T — 1) differ- 
encing matrix defined in IASW (1999, eq. (4.1))—is identical to the FE estimator. 
Therefore, if we intend to use an RE structure for Ê in a GMM 3SLS analysis with 
instruments Lr © x?, we might as well just use the usual FE estimator applied to 
(11.2). But if we use an unrestricted form of Q—presumably because we think 
E(uju)) 4 o717—the GMM 3SLS estimator that uses instruments Lr ® x? is gener- 
ally more efficient than FE if Q does not have the RE form. Further, if the system 
homoskedasticity assumption E(u;u; | x;, c;) = E(ujuj) fails, GMM with the general 
optimal weighting matrix would usually increase efficiency over the GMM 3SLS 
estimator. 


11.1.2 Chamberlain’s Approach to Unobserved Effects Models 


We now study an approach to estimating the linear unobserved effects model (11.1) 
due to Chamberlain (1982, 1984) and related to Mundlak (1978). We maintain the 
strict exogeneity assumption on X; conditional on c; (see Assumption FE.1), but we 
allow arbitrary correlation between c; and xy. Thus we are in the FE environment, 
and x; contains only time-varying explanatory variables. 
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In Chapter 10 we saw that the FE and FD transformations eliminate c; and pro- 
duce consistent estimators under strict exogeneity. Chamberlain’s approach is to re- 
place the unobserved effect c; with its linear projection onto the explanatory variables 
in all time periods (plus the projection error). Assuming c; and all elements of x; have 
finite second moments, we can always write 


Ci = Y + Xiha t+ Xd. +: + XiTÀr + 4i, (11.5) 
where w is a scalar and 21,...,Ar are 1 x K vectors. The projection error a;, by def- 
inition, has zero mean and is uncorrelated with x;,...,X;r. This equation assumes 


nothing about the conditional distribution of c; given x;. In particular, E(c;|x;) is 
unrestricted, as in the usual FE analysis. Therefore, although equation (11.5) has the 
flavor of a correlated random effects specification, it does not restrict the dependence 
between c; and x; in any way. 

Plugging equation (11.5) into equation (11.1) gives, for each 1, 


Ya =Y + Xadi + + Xal h + Ad) +++ + Xi Ar + Fit, (11.6) 
where, under Assumption FE.1, the errors rj, = a; + uj: satisfy 
E(ra) =0, E(xirz) =0, t=1,2,...,T. (11.7) 


However, unless we assume that E(c; | x;) is linear, it is not the case that E(ri|x;) = 
0. Nevertheless, assumption (11.7) suggests a variety of methods for estimating £ 
(along with W,41,...,4r). 

Write the system (11.6) for all time periods ¢ as 


y 
Via l Xa X2 © Xr XA Ay ri 
Yn l Xa X2 >e Xr Xr Ay rn 
- kala (118) 
Vir l xa Xp XiT XiT Ar rig 
B 
or 
y; = WO + r;, (11.9) 


where W; is T x (1 + TK + K) and @ is (1+ TK + K) x 1. From equation (11.7), 
E(Wir;) = 0, and so system OLS is one way to consistently estimate 0. The rank 
condition requires that rank E(W;W;) = 1 + TK + K; essentially, it suffices that the 
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elements of x; are not collinear and that they vary sufficiently over time. While sys- 
tem OLS is consistent, it is very unlikely to be the most efficient estimator. Not only 
is the scalar variance assumption E(r;r/) = 9717 highly unlikely, but also the homo- 
skedasticity assumption 


E(rir; | x;) = E(rr}) (11.10) 


fails unless we impose further assumptions. Generally, assumption (11.10) is violated if 
E(u;u; | ¢;,x;) # E(uju;), if E(c; | x;) is not linear in x;, or if Var(c; | x;) is not constant. 

If assumption (11.10) does happen to hold, feasible GLS is a natural approach. 
The matrix Q = E(r;r/) can be consistently estimated by first estimating 0 by system 
OLS, and then proceeding with FGLS as in Section 7.5. 

If assumption (11.10) fails, a more efficient estimator is obtained by applying GMM 
to equation (11.9) with the optimal weighting matrix. Because r; is orthogonal to 
x? = (1,Xi,...,Xr), X? can be used as instruments for each time period, and so we 
choose the matrix of instruments (11.3). Interestingly, the 3SLS estimator, which uses 
[Z'(Iy ® Q)Z/N]"' as the weighting matrix—see Section 8.3.4—is numerically 
identical to FGLS with the same Ê. Arellano and Bover (1995) showed this result in 
the special case that Ê has the random effects structure, and [ASW (1999, Theorem 
3.1) obtained the general case. 

In expression (11.9) there are 1 + TK + K parameters, and the matrix of instru- 
ments is T x T(1 + TK); there are T(1 + TK) — (1 + TK + K) = (T — 1)(1 + TK) 
— K overidentifying restrictions. Testing these restrictions is precisely a test of the 
strict exogeneity Assumption FE.1, and it is a fully robust test when full GMM is 
used because no additional assumptions are used. 

Chamberlain (1982) works from the system (11.8) under assumption (11.7), but he 
uses a different estimation approach, known as minimum distance estimation. We 
cover this approach to estimation in Chapter 14. 


11.2 Random and Fixed Effects Instrumental Variables Methods 


In this section we study estimation methods when some of the explanatory variables 
are not strictly exogenous. RE and FE estimations assume strict exogeneity of the 
instruments conditional on the unobserved effect, and RE estimation adds the as- 
sumption that the IVs are actually uncorrelated with the unobserved effect. 

We again start with the model (11.1). In Chapter 10 and in Section 11.1, we cov- 
ered methods that assume, at a minimum, 


E(xi,uir) = 0, St 1 esl. (11.11) 
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(The one exception was pooled OLS, but POLS assumes X; is uncorrelated with c;.) 
If we are willing to assume (11.11), but no more, then, as we saw in Chapter 10 and 
Section 11.1, we can consistently estimate the coefficients on the time-varying ele- 
ments of xy by FE, FD, GLS versions of these, or GMM. The next goal is to relax 
assumption (11.11). 

Without assumption (11.11), we generally require IVs in order to consistently esti- 
mate the parameters. Let {z;,:t=1,...,7} be a sequence of 1 x L IV candidates, 
where L > K. As we discussed in Chapter 8, a simple estimator is the pooled 2SLS 
estimator (P2SLS). With an unobserved effect explicitly in the error term, consistency 
of P2SLS essentially relies on 


E(z;,ci) = 0, el eee ot (11.12) 
and 
E(zi,uir) = 0, (a eee & (11.13) 


along with a suitable rank condition (which you are invited to supply). Pooled 
2SLS estimation is simple, and it is straightforward to make inference robust 
to arbitrary heteroskedasticity and serial correlation in the composite errors, 
{vu = c; + uu: t=1,...,T}. But if we are willing to assume (11.12) along with 
(11.13), we probably are willing to assume that the instruments are actually strictly 
exogenous (after removing c;), that is, 


PS Tit (11.14) 


Assumptions (11.12) and (11.14) suggest that, under certain assumptions, an RE 
approach can be more efficient than a pooled 2SLS analysis. The random effects 
instrumental variables (REIV) estimator is simply a generalized IV (GIV) estimator 
applied to 


Y; = XiP +i, (11.15) 


where X; is T x K, as usual, and the matrix of instruments Z; = (z/,,2/),...,Zir)' is 
T x L. What makes the REIV estimator special in the class of GIV estimators is that 
Q = E(v;v;) is assumed to have the RE form, just as in equation (10.30). 

Assuming, for the moment, that Q is known, the REIV estimator can be obtained 
as the system 2SLS estimator (see Chapter 8) of 


Q7 Py, = QOX; p+ Q y; (11.16) 


using transformed instruments Q-'/°Z,;. The form of the estimator is given in equa- 
tion (8.47). 
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As we discussed in Chapter 8, consistency of the GIV estimator when Q is not di- 
agonal—as expected in the current framework because of the presence of c;, not to 
mention possible serial correlation in {u;i : t= 1,...,T}—hinges critically on the 
strict exogeneity of the instruments: E(zi vi) = 0 for all s,t=1,..., T. Without this 
assumption the REIV estimator is inconsistent. 

Naturally, we have to estimate Q—imposing the RE form—but this follows our 
treatment for the usual RE estimator. The first-stage estimator is pooled 2SLS. Given 
the P2SLS residuals, we can estimate o? and a? just as in equations (10.35) and 
(10.37). 

We summarize with a set of assumptions, which are somewhat stronger than nec- 
essary for consistency (because we can always get by with zero covariances) but have 
the advantage of simplifying inference under the full set of assumptions. 


ASSUMPTION REIV.1: (a) E(u |z; ci) = 9, t=1,...,73 (b) E(c |z) = E(c;) = 0, 
where z; = (Zj1,..-,Zir). 


As usual, the assumption that c; has a zero mean is without loss of generality pro- 
vided we include an intercept in the model, which we always assume here. 
The rank condition is 


ASSUMPTION REIV.2: (a) rank E(Z/Q7'Z,) = L; (b) rank E(Z/Q7'X;) = 


Because Assumption RE.1 implies Assumption GIV.1 and Assumption RE.2 is 
Assumption GIV.2, the REIV estimator is consistent under these two assumptions. 
Without further assumptions, a valid asymptotic variance estimator of Bpp,) should 
be fully robust to heteroskedasticity and any pattern of serial correlation in {vj}. The 
formula is long but standard; see Problem 11.18. 

Typically, the default is to add an assumption that simplifies inference: 


ASSUMPTION REIV.3: (a) E(uju; | Z;, c;) = o217; (b) E(c? | Zi) = 0? 


Og. 


As in the case of standard RE, the particular form of Q is motivated by Assumption 
REIV.3 and the estimator is the asymptotically efficient IV estimator under the full 
set of assumptions. The simplified (nonrobust) form of the asymptotic variance esti- 
mator can be expressed as 

-1 


N T/N 
Avar( = (sox ô- z) (>. zó z) (>. zax ) : (11.17) 
i=1 i=1 


This matrix can be used to construct asymptotic standard errors and Wald tests, but 
it does rely on Assumption REIV.3. 
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Given that the usual RE estimator can be obtained from a pooled OLS regression 
on quasi-time-demeaned data, it comes as no surprise that a similar result holds for 
the REIV estimator. Let Y = vi — AV Xi = X — AXi, and Zi; = Zit — Îz; be the 
transform-dependent variable, regressors, and IVs, respectively, where Â is given as in 
equation (10.81) (om where the P2SLS residuals, rather than the POLS residuals, are 
used to obtain 6? and ô?). Then the REIV estimator can be obtained as the P2SLS 
estimator on 


Vip = Xuf + errori, ts igi ey ES Wigs N; (11.18) 


using IVs Z;;. This formulation makes fully robust inference especially easy using any 
econometrics packages that compute heteroskedasticity and serial correlation robust 
standard errors and test statistics for pooled 2SLS estimation. Because of its inter- 
pretation as a pooled 2SLS estimator on transformed data, the REIV estimator is 
sometimes called the random effects 2SLS estimator. Baltagi (1981) calls it the error 
components 2SLS estimator. 

It is straightforward to adapt standard specification tests to the REIV framework. 
For example, we can test the null hypothesis that a set of explanatory variables is 
exogenous by modifying the control function approach from Chapter 6. Suppose we 
write an unobserved effects model as 


Vin = Zin0) + Yint + Yiyi + ca + Uir, (11.19) 


where vin = ci + Uin is the composite error, and we want to test Ho : E(yj,vin1) = 0. 
Actually, in an RE environment, it only makes sense to maintain strict exogeneity of 
the 1 x J; vector y;,; under the null. The variables y;,. are allowed to be endogenous 
under the null—provided, of course, that we have sufficient instruments excluded 
from (11.19) that are uncorrelated with the composite errors in every time period. We 
can write a reduced form for y,3 as 


Yig = Ziel + vies, (11.20) 


and then we augment (11.19) by including vi39. Of course, we must decide how to 
estimate (11.20) to obtain residuals, ¥:3 = y;;3 — Zils, where the columns of Ñ; can 
be pooled OLS or standard RE estimates. That is, we can estimate the J; equations in 
(11.20) separately by pooled OLS or by RE. Either way, to obtain the test, we simply 
estimate 


Vin = Zin 01 + Yin + iad + Pipi + errori (11.21) 


by REIV, using instruments (2;;,Y;,3,Vi3), and test Ho: p; = 0 using a Wald test. 
Under the null hypothesis that y,,3 is strictly exogenous (with respect to {vj }), we 
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can ignore the first-stage estimation. (Actually, if we use the usual, nonrobust test, 
then the null is Assumptions RE.1—RE.3 but with the instruments for time period t 
given by (Zit, ¥;3).) We might, of course, want to make the test robust to violation of 
the RE variance structure as well as to system heteroskedasticity. 

Testing overidentifying restrictions is also straightforward. Now write the equa- 
tion as 


Yin = Zid) + Yint + ca + Uin, (11.22) 


where Zin is the 1 x Lz vector of exogenous variables excluded from (11.19). We as- 
sume Ly > Gi = dim(y;,.), and write Zi2 = (gip, hin), where g;. is 1 x G;—the same 
dimension as y;,.—and hj is 1 x Q;—the number of overidentifying restrictions. As 
discussed in Section 6.3.1, it does not matter how we choose hj provided it has Q; 
elements. (In the case where the model contains separate time period intercepts, these 
would be included in Z; and then act as their own instruments. Therefore, hj. never 
includes aggregate time effects.) 

The test is obtained as follows. Obtain the quasi-time-demeaned RE residuals, 
tint = it E hit, where Üi = ym — xin By are the RE residuals (with Xi = (Zit, Vir), 
along with the quasi-time-demeaned explanatory variables and instruments, 
Kia = Xm — ÂX; and Zy = Zu — AzZ;. Let Vio be the fitted values from the first stage 
pooled regression ¥;,. on Z;;. Next, use pooled OLS of ħ;n on Zin, ¥;,. and obtain the 
1 x Qı residuals, Kio. Finally, use the augmented equation 


Uin = fion + error in (11.23) 


to test Ho: 4, = 0 by computing a Wald statistic that is robust to both hetero- 
skedasticity and serial correlation. If Assumption REIV.3 is maintained under Ho, 
the usual F test from the pooled OLS regression in equation (11.23) is asymptotically 
valid. 

As is clear from the preceding analysis, a random effects approach allows some 
regressors to be correlated with c; and uy in (11.1), but the instruments are assumed 
to be strictly exogenous with respect to the idiosyncratic errors and to be uncorrelated 
with c;. In many applications, instruments are arguably uncorrelated with idiosyn- 
cratic shocks in all time periods but might be correlated with historical or immutable 
factors contained in cj. In such cases, a fixed effects approach is preferred. 

As in the case with regressors satisfying (11.2), we use the within transformation to 
eliminate c; from (11.1): 


Yit — Ji = (Xi — Xi) + Ui — üi, t=1,...,T7, (11.24) 
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which makes it clear that time-constant explanatory variables are eliminated, as in 
standard FE analysis. If we estimate equation (11.24) by P2SLS using instruments 
Ži = Zi — Z; (1 x L with L > K), we obtain the fixed effects instrumental variables 
(FEIV) estimator or the fixed effects 2SLS (FE2SLS) estimator. Because for each 
i we have yo) 2i,(xi — 3) = yoy (Gi 2i) (Xe 3) and YL) yOu- Fi) = 
SE (Zi — Zi) ( Yi — ¥;), it does not matter whether or not we time demean the 
instruments. But using ž; emphasizes that FEIV works only when the instruments 
vary over time. (For an REIV analysis, instruments can be constant over time pro- 
vided they satisfy Assumptions REIV.1 and REIV.2.) 

Sufficient conditions for consistency of the FE2SLS estimator are immediate. Not 
surprisingly, a sufficient exogeneity condition for consistency of FEIV is simply the 
first part of the exogeneity assumption for REIV: 


ASSUMPTION FEIV.1: E(u |z; ci) = 0, t= 1,..., 7. 


In practice, it is important to remember that Assumption FEIV.1 is strict exogeneity 
of the instruments conditional on c;; it allows arbitrary correlation between Z; and c; 
for all ż£. The rank condition is 


ASSUMPTION FEIV.2: (a) rank >), E(#/,Zir) = L; (b) rank S77, E(#,xir) = K. 


As usual, we can relax FEIV.1 to just assuming Z;s and uy are uncorrelated for all s 
and ¢. But if we impose Assumptions FEIV.1, FEIV.2, and 


ASSUMPTION FEIV.3: E(uju;|z;,¢;) = o2Ir, 


then we can apply standard pooled 2SLS inference on the time-demeaned data, pro- 
vided we take care in estimating a7. As in Section 10.5, we must use a degrees-of- 
freedom adjustment for the sum of squared residuals (effectively for the lost time 
period for each i). See equation (10.56), where we now use the FE2SLS residuals, 
tit = ¥i, —XiBreoszs, in place of the usual FE residuals. Problem 11.9 asks you to 
work through some of the details. 

As with REIV methods, testing a subset of variables for endogeneity and testing 
overidentification restrictions are immediate. Rather than estimate equation (11.21) 
by REIV, we use FEIV, where ¥;,3 is conveniently replaced with ¥;3, the FE residuals 
from estimating the reduced form of each element of y,,; by fixed effects. (So, in the 
scalar case, we estimate yi3 = 2,73 + ai + Vig by fixed effects and then compute 
tig = Vi — Zz.) It is important to remember that, unlike the test in the REIV case, 
the FEIV test does not maintain Assumption REIV.1b under the null. That is, for 
FE, exogeneity of y;,,; does not include that it is uncorrelated with c;,, something we 
know well from Chapter 10. 
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To test overidentifying restrictions, we let ui be the FEIV residuals, let z;, denote 
the time-demeaned instruments, and let f; be the fitted values from the first-stage 
pooled OLS regression ¥,,. on Zi. Then, the 1 x Qı residuals tio are obtained from 
the pooled OLS regression hj on Žin, in. To obtain a valid test statistic, we use Üi 
and fin in place of iin) and Fin , respectively, in equation (11.23). Again, a fully robust 
test of Ho : 4; = 0 is preferred. If one maintains Assumption FEIV.3 under the null, 
some care is needed in estimating Var(uj:1) = 0. Implicitly, the variance estimator 
obtained by using the pooled regression Hin ON fin, t= 1,...T, i=1,...,N uses the 
sum of squared residuals divided by NT — Qı, whereas the correct factor is 
N(T — 1) — Qı. Therefore, the nonrobust Wald statistic from pooled OLS should be 
multiplied by [V(T — 1) — Qı] /(NT — Q1). 

With FEIV, we are testing whether the time-demeaned extra instruments, hiv, are 
uncorrelated with the idiosyncratic errors, and not that they are also uncorrelated 
with the unobserved effect. This is as it should be, as consistency of the FEIV esti- 
mator does not hinge on E(zj,c;) = 0. Now, of course, all instruments—including 
those in hj.—must vary over time (and the elements of hj. must vary across i, too). 

If E(uju)) 4 0217 but we nominally think E(u,u! | z;,c;) = E(uju!), we can obtain a 
more efficient estimator by applying GIV to the set of equations 


¥,=XP+ij, (11.25) 


where we drop one time period to avoid singularity of the error variance matrix. 
We apply the GIV estimator to (11.25) using instruments Z;. The formula is given 
by equation (8.47) but with the time-demeaned matrices (minus one time period). 
Naturally, we would use the FE2SLS estimator in the first stage to estimate Q. 
We may want to use a fully robust variance matrix in case system homoskedasticity 
fails. 

The Hausman principle can be applied for comparing the FEIV and REIV esti- 
mators, at least for the coefficients on the variables that change across i and t. As in 
Section 10.7.3, a regression-based test is easiest to carry out (but one should always 
compare the magnitudes of the FEIV and REIV estimates to see whether they are 
different in practically important ways). If we maintain Assumption RE.la—which 
underlies consistency of both estimators—then we can view the Hausman test as a 
test of E(c;|z;) = E(c;), and this is the usual interpretation of the test. As before, we 
specify as an alternative a correlated random effects structure, using the subset w;, of 
Zi, that varies across i and ¢, c; = Čo + Z;č + a;, and then augment the orginal equa- 
tion (absorbing € into the intercept): 


Vit = Xup + Wis + aj + Ui- (11.26) 
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Now, estimate (11.26) by REIV using instruments (Z;r, W;) and test č = 0. If we make 
the test robust, then we are not maintaining Assumption REIV.3 under Ho. We can 
also estimate (11.26) by pooled 2SLS and obtain a fully robust test. 

REIV and FEIV methods can be applied to simultaneous equations models for 
panel data, models with time-varying omitted variables, and panel data models with 
measurement error. The following example uses airline route concentration ratios to 
estimate a passenger demand equation. 


Example 11.1 (Demand for Air Travel): The data set AIRFARE.RAW contains 
information on passengers, airfare, and route concentration ratios for 1,149 routes 
within the United States for the years 1997 through 2000. We estimate a simple de- 
mand model 


log(passenj,) = On + «ı log( farer) + 0; log(dist;) + dy[log(dist;)]? + ci + Uin, 
(11.27) 


where we allow for separate year intercepts. The variable dist; is the route distance, in 
miles; naturally, it does not change over time. 

We estimate this equation using four different methods: RE, FE, REIV, and FEIV, 
where the variable log( fare;;) is treated as endogenous in the latter two cases. The IV 
for log( farer) is concen; the fraction of route traffic accounted for by the largest 
carrier. For the REIV estimation, the distance variables are treated as exogenous 
controls. Of course, they drop out of the FE and FEIV estimation. 

For the IV estimation, we can think of the reduced form as being 


log( farei) = On + m21concen, + nz log(dist;) + 723 [log(dist;)| +cn + uim, (11.28) 


which is (implicitly) estimated by RE or FE, depending on whether (11.27) is esti- 
mated by REIV or FEIV. The results of the estimation are given in Table 11.1. 

The RE and FE estimated elasticities of passenger demand with respect to airfare 
are very similar (about —1.1 and —1.2, respectively), and neither is statistically dif- 
ferent from —1. The closeness of these estimates is, perhaps, not too surprising, given 
that Â = .915 for the RE estimation. This indicates that most of the variation in the 
composite error is estimated to be due to c;;, but, as we remember, that calculation 
assumes that {u;n } is serially uncorrelated. 

The large increase in standard errors when we allow for arbitary serial correlation 
(and, less importantly, heteroskedasticity) suggests the idiosyncratic errors are not 
serially uncorrelated. The robust standard error for RE, .102, is almost five times as 
large as the nonrobust standard error, .022. The increase for FE is not quite as dra- 
matic but is still substantial. In either case, the 95 percent confidence intervals are 
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Table 11.1 
Passenger Demand Model, United States Domestic Routes, 1997—2000 
Dependent Variable log( passen) 
(1 (2) B (4) 
Explanatory Variable Random Effects Fixed Effects REIV FEIV 
log( fare) —1.102 —1.155 —.508 —.302 
.022) (.023) .230) (.277) 
.102] [.083] .498| [.613] 
log(dist) —1.971 — —1.505 — 
.647) .693) 
.704| .784| 
[log(dist)]° ayi = 118 = 
.049) .055) 
: .054] .067] 
A 915 911 


N 1,149 1,149 1,149 1,149 


All estimation methods include year dummies for 1998, 1999, and 2000 (not reported). 

The usual, nonrobust standard errors are in parentheses below the estimated coefficients. Standard errors 
robust to arbitrary heteroskedasticity and serial correlation are in brackets. 

All standard errors for RE and FE were obtained using the xtreg command in Stata 9.0. The nonrobust 
standard errors for REIV and FEIV were obtained using the command xtivreg. The fully robust standard 
errors were obtained using P2SLS on the quasi-time-demeaned and time-demeaned data, respectively. 


much wider when we use the more reliable robust standard errors. Clearly, given the 
closeness of the elasticity estimates, there is no point formally comparing RE and FE 
via a Hausman test. 

Of course, even the FE estimator assumes that log( fare) is uncorrelated with 
{uin :t=1,...,7}, and this assumption could easily fail. Columns (3) and (4) use 
concen as an IV for log( fare). (The reduced form for log( fare) depends positively and 
in a statistically significant way on concen, whether we use RE or FE, and using fully 
robust standard errors.) The REIV estimator assumes that the concentration ratio is 
uncorrelated with the idiosyncratic errors as well as the route heterogeneity, cj,. Even 
so, the estimated elasticity, —.508, is about half the size of the RE estimate. Based on 
the nonrobust standard error, this elasticity is statistically different from zero (as well 
as —1) at the 5 percent significance level. But when we use the more realistic robust 
standard error, the elasticity becomes statistically insignificant; its robust ¢ statistic is 
about —1.02. Therefore, once we instrument for log( fare) and use robust inference, 
we can say very little about the true elasticity. 

The estimated elasticity for FEIV is even smaller in magnitude and even less pre- 
cisely estimated. The estimated elasticity, —.302, is not even statistically different 
from zero when we use the overly optimistic nonrobust standard error. The fully 
robust ¢ statistic is less than .5 in magnitude. 
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We can use the Hausman test—which is a simple ¢ statistic in this example—to 
compare the models a few different ways. We consider only two. First, because the 
FEIV and FE estimates are practically different, we might want to know whether 
there is statistically significant evidence that log(fare;,) is endogenous (correlated 
with the idiosyncratic errors). Of course, we must maintain strict exogeneity of 
concen;,, but only in the sense that it is uncorrelated with {uj}. We obtain the FE 
residuals, Îin from the reduced form estimation. Then we add iia to equation (11.27), 
estimate it by standard FE, and obtain the ¢ statistic on Bip. Its nonrobust ¢ statistic is 
—3.68, but the fully robust statistic is only —1.63, which implies a marginal statistical 
rejection. Given the large differences in elasticity estimates between FE and FEIV, 
one would not feel comfortable relying on the FE estimates. 

We can also check for statistical significance between REIV and FEIV. In this 
case, the easiest way to implement the test is to estimate the original equation, aug- 
mented by the time average concen;, by REIV and use the usual f statistic on concen; 
or the robust form. (A simpler procedure works for obtaining the fully robust statis- 
tic: just estimate the augmented equation by pooled IV and use the fully robust t 
statistic on concen;.) The nonrobust f statistic on concen; is —3.08 while the robust ¢ 
statistic is —2.00. Interestingly, neither estimate is statistically different from zero, but 
they are statistically different from each other, at least marginally. 


Panel data methods combined with IVs have many applications to models with 
simultaneity or omitted time-varying variables. For example, Foster and Rosenzweig 
(1995) use the FEIV approach to the effects of adoption of high-yielding seed vari- 
eties on household-level profits in rural India. Ayers and Levitt (1998) apply FE2SLS 
to estimate the effect of Lojack electronic theft prevention devices on city car-theft 
rates. Papke (2005) applies FE2SLS to building-level panel data on test pass rates 
and per-student spending. In each of these cases, unit-specific heterogeneity is elimi- 
nated, and then IVs are used for the suspected endogenous explanatory variable. 


11.3 Hausman and Taylor—Type Models 


The results of Section 11.2 apply to a class of unobserved effects models studied by 
Hausman and Taylor (1981) (HT). The key feature of these models is that the 
assumptions imply the availability of instrumental variables from within the model; 
one need not look outside the model for exogenous variables. 

The HT model can be written as 


Vig = Wi H Xup cit umn t= 1,2,...,T, (11.29) 
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where all elements of x; display some time variation, and it is convenient to include 
unity in w; and assume that E(c;) = 0. We assume strict exogeneity conditional on ¢;: 


E(u | Wi, Xi,- <, XiT, Ci) = 9, A ag Le (11.30) 


Estimation of f can proceed by FE: the FE transformation eliminates w;y and c;. As 
usual, this approach places no restrictions on the correlation between c; and (wij, Xir). 
What about estimation of y? If in addition to assumption (11.30) we assume 


E(w/c;) = 0, (11.31) 


then a /N-consistent estimator is easy to obtain: average equation (11.29) across 
t, premultiply by w/, take expectations, use the fact that E[w;(c; + ū;)] = 0, and re- 
arrange to get 


E(wjwi)y = Elw; O; — %B)]. 


Now, making the standard assumption that E(w/w,) is nonsingular, it follows by the 
usual analogy principle argument that 


N = N 
d= (r 5 ww) I- 5 w: (7; — Xipre) 
i=l i=l 


is consistent for y. The asymptotic variance of VN (f — y) can be obtained by stan- 
dard arguments for two-step estimators. Rather than derive this asymptotic variance, 
we turn to more general assumptions. 

Hausman and Taylor (1981) partition w; and x; as w;=(Wj,Wp), Xi = 
(Xin, Xin), Where wy is 1 x Jj, Wp is 1 x Jo, Xin is 1 x Ky, Xin is 1 x Ky, and assume 
that 


E(wi,c:) = 0 and E(x.) =0, all t. (11.32) 


We still maintain assumption (11.30), so that w; and x;, are uncorrelated with u; for 
all ¢ and s. 

Assumptions (11.30) and (11.32) provide orthogonality conditions that can be used 
in a method of moments procedure. Hausman and Taylor actually imposed enough 
assumptions so that the variance matrix Q of the composite error v; = cijy + u; has 
the random effects structure and Assumption SIV.5 from Section 8.3.4 holds for the 
relevant matrix of instruments. Neither of these is necessary, but together they afford 
some simplifications. 

Write equation (11.29) for all T time periods as 


y; = Wiy + Xi + vi. (11.33) 
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Since x; is strictly exogenous and Q;v; = Qru; [where Qr = Ir — jr (iņir) ‘ip is 
again the T x T time-demeaning matrix], it follows that E[(Q,X;)'v;] = 0. Thus, the 
T x K matrix Q;X; can be used as instruments in estimating equation (11.33). If 
these were the only instruments available, then we would be back to FE estimation of 
B without being able to estimate y. 

Additional instruments come from assumption (11.32). In particular, w;; is orthog- 


onal to vy for all ¢, and so is xf, the 1 x TK; vector containing xj for all t = 1,..., T. 
Thus, define a set of instruments for equation (11.33) by 
[(QrXi, ir Q (wa, x4)], (11.34) 


which is a T x (K + J; + TK,) matrix. Simply put, the vector of IVs for time period t 
is (Xj, Wa, x4). With this set of instruments, the order condition for identification of 
(y,B) is that K + Jı + TK; > J + K, or TK, = Jp. In effect, we must have a sufficient 
number of elements in xf to act as instruments for wy. (X; are the IVs for Xp, and wi 
act as their own IVs.) Whether we do depends on the number of time periods, as well 
as on Kj. 

Actually, matrix (11.34) does not include all possible instruments under assump- 
tions (11.30) and (11.32), even when we focus only on zero covariances. However, 
under the full set of Hausman-Taylor assumptions mentioned earlier—including the 
assumption that Q has the random effects structure—it can be shown that all in- 
struments other than those in matrix (11.34) are redundant in the sense of Section 8.6; 
see IASW (1999, Theorem 4.4) for details. 

Hausman and Taylor (1981) suggested estimating y and $ by REIV. As described 
in Section 11.2, REIV can be implemented as a P2SLS estimator on transformed 
data. For the particular choice of instruments in equation (11.34), we first estimate 
the equation by P2SLS using instruments z = (Xi, Wi, X4) for time period t. From 
the P2SLS residuals, the quasi-time-demeaning parameter, A, is obtained as in equa- 
tion (10.81), where ô? and ô? are gotten from the P2SLS residuals, say Č; = 
Vit — Wi? — XB, rather than the POLS residuals. The quasi-time-demeaning opera- 
tion is then carried out as in equation (11.18). Some software packages contain spe- 
cific commands for estimating HT models, but not all allow for fully robust inference 
(that is, violation of Assumption REIV.3). The P2SLS estimation on the quasi-time- 
demeaned data makes it easy to obtain fully robust inference for any statistical 
package that computes heteroskedasticity and serial correlation robust standard 
errors and test statistics for P2SLS. 

If Q is not of the random effects form, or if Assumption SIV.5 fails, many more 
instruments than are in matrix (11.34) can help improve efficiency. Unfortunately, 
the value of these additional IVs is unclear. For practical purposes, 3SLS with Q of the 
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RE form, 3SLS with © unrestricted, or GMM with optimal weighting matrix—using 
the instruments in matrix (11.34)—should be sufficient, with the latter being the most 
efficient in the presence of conditional heteroskedasticity. The first-stage estimator can 
be the system 2SLS estimator using matrix (11.34) as instruments. The GMM over- 
identification test statistic can be used to test the 7K, — J2 overidentifying restrictions. 

In cases where Kı > J2, we can reduce the instrument list even further and still 
achieve identification: we use Xj, rather than x4, as the instruments for w,2. Then, the 
IVs at time ¢ are (Xj, Wi1, Xi). We can then use the GTV estimator with this new set of 
IVs. Quasi-demeaning leads to an especially simple analysis. Although it generally 
reduces asymptotic efficiency, replacing x4 with X; is a reasonable way to reduce the 
instrument list because much of the partial correlation between w; and xj, is likely to 
be through the time average, X;;. Some econometrics packages implement this version 
of the HT estimator. 

HT provide an application of their model to estimating the return to education, 
where education levels do not vary over the two years in their sample. Initially, HT 
include as the elements of x;n all time-varying explanatory variables: experience, an 
indicator for bad health, and a previous-year unemployment indicator. Race and 
union status are assumed to be uncorrelated with c;, and, because these do not 
change over time, they comprise w;. The only element of z is years of schooling. 
HT apply the GIV estimator and obtain a return to schooling that is almost twice as 
large as the pooled OLS estimate. When they allow some of the time-varying ex- 
planatory variables to be correlated with c;, the estimated return to schooling gets 
even larger. It is difficult to know what to conclude, as the identifying assumptions 
are not especially convincing. For example, assuming that experience and union sta- 
tus are uncorrelated with the unobserved effect and then using this information to 
identify the return to schooling seems tenuous. 

Breusch, Mizon, and Schmidt (1989) studied the Hausman-Taylor model under the 
additional assumption that E(x;,,c;) is constant across t. This adds more orthogonality 
conditions that can be exploited in estimation. See [ASW (1999) for a recent analysis. 


11.4 First Differencing Instrumental Variables Methods 


We now turn to first differencing methods combined with IVs. The model is as in 
(11.1), but now we remove the unobserved effect by taking the first difference: 
Ayi = AX up + Aui, t=2,...,T, (11.35) 


or, in matrix form, 
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Ay; = AXif + Au;. (11.36) 


If we have instruments z; satisfying Assumption FEIV.1, and if the IVs have the 
same dimension L for all ¢, then we can apply, say, P2SLS to (11.35) using instru- 
ments Az;;. Analogous to FD and FE estimation with strictly exogenous xj, we could 
adopt Assumption FEIV.1 as our first assumption for the first difference instrumental 
variables (FDIV) estimator. But here we are interested in more general cases. Rather 
than necessarily differencing underlying instruments, we now let w; denote a 1 x L; 
vector of instrumental variables that are contemporaneously exogenous in the FD 
equation: 


ASSUMPTION FDIV.1: For t= 2,...,7, E(wi Aur) = 0. 


The important point about Assumption FDIV.1 is that we can choose the elements of 
wi, that are not required to be strictly exogenous (conditional on c;) in the original 
mode (11.1). We allow the dimension of the instruments to change—usually, grow— 
as t increases. When the dimension of w; changes with ¢, we choose the instrument 
matrix as the (T — 1) x L matrix 


W; = diag(wy, W73,--- Wir), 


where L = L + L13+---+Lr; see also equation (8.15). Sometimes, when w; has 
dimension L for all ¢, we prefer to choose 


W; = (Wi, ---, Wir) - 
In any case, the rank condition for the system IV estimator on the FD equation is 
ASSUMPTION FDIV.2: (a) rank E(W;W;) = L; (b) rank E(W/AX;) = K. 


Under FDIV.1 and FDIV.2, we can consistently estimate # using the system 2SLS 
estimator, as described in Section 8.3.2. As discussed there, the S2SLS estimator in 
the panel data context can be computed rather easily, but its characterization 
depends on the structure of W;. The first step is to run separate T — 1 first-stage 
regressions, 


Ax; On Wn,  i=1,2,...,N, (11.37) 


and obtain the fitted values, say Xi, (a 1 x K vector for all i and #). Then, estimate 
(11.35) by pooled IV using instruments Ax; for Ax;;. Inference is standard because 
one can compute a variance matrix estimator robust to arbitrary heteroskedasticity 
and serial correlation in {ep = Au, : t = 2,..., T}. It is left as an exercise to state the 
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FD version of FEIV.3 that ensures we can use the standard pooled 2SLS variance 
matrix. 

In many applications of FDIV, the errors in the FD equation (11.35) are neces- 
sarily serially correlated, often because the errors in equation (11.1) start off being 
serially uncorrelated; we cover the leading case in the next section. In such cases, we 
might wish to apply GMM with an optimal weighting matrix (which could, in some 
cases, be GMM 3SLS). In a first stage we need to obtain (T — 1) x 1 residuals, say 
é; = Ay, — AX,B, where Ď is probably the system 2SLS estimator described earlier. 
An optimal weighting matrix is 


N —1 
Ge 5 waew . (11.38) 
j=l 


If E(W;e;e!W;) = E(W/QW,), where Q = E(e;e!), then we can replace ě;ě; (for all i) 
in equation (11.38) with the (T — 1) x (T — 1) estimator Q = N~! YS éé. All of 
the GMM theory we discussed in Chapter 8 applies directly. 

For a model with a single endogenous explanatory variable along with some 
strictly exogenous regressors, a general FD equation is 


Ayia = Na + MAVi2 + Azind) + Auin 


with instruments W; = (Azin, Wi2), Where wi2 is a set of exogenous variables omitted 
from the structural equation. These could be differences themselves, that is, wi = 
Azi2, where we may think of the reduced form for yin being yin = Zun + Cn + Uin 
and then we difference to remove cj. In the next example, from Levitt (1996), 
the instruments for the endogenous explanatory variable are not obtained by 
differencing. 


Example 11.2 (Effects of Prison Population on Crime Rates): In order to estimate 
the causal effects of prison population increases on crime rates at the state level, 
Levitt (1996) uses episodes of prison overcrowding litigation as instruments for the 
growth in the prison population. Underlying Levitt’s FD equation is an unobserved 
effects model, 


log(crimej,) = 0n + %1 log( prisong) + 2:16) + ci + uin, (11.39) 


where 0,; represents different time intercepts and both crime and prison are measured 
per 100,000 people. (The prison population variable is measured on the last day of 
the previous year.) The vector Z; contains the log of police per capita, log of per 
capita income, proportions of the population in four age categories (with omitted 


364 Chapter 11 


group 35 and older), the unemployment rate (as a proportion), proportion of the 
population that is black, and proportion of the population living in metropolitan 
areas. The differenced equation is 


A log(crime) =n, + «A log( prison) + Azin6, + Ain, (11.40) 


and the instruments are (Azin, finally, final2;,); the last two variables are binary 
indicators for whether final decisions were reached on prison overcrowding legisla- 
tion in the previous year and previous two years, respectively. We estimate this 
equation using the data in PRISON.RAW. 

The first-stage regression POLS regression of A log(prisonj,) on Azin, finall i, 
final2;;,, and a full set of year dummies yields fully robust ¢ statistics for finalli; and 
final2;, of —4.71 and —3.30, respectively. The robust test of joint significance of the 
two instruments gives F = 18.81 and a p-value of zero to four decimal places. 
Therefore, assuming the litigation variables are uncorrelated with the idiosyncratic 
changes Au;n, a P2SLS estimation on the FD equation is justified. The 2SLS estimate 
of a is — 1.032 (fully robust se = .213). This is a large elasticity. By comparison, the 
POLS estimates on the FD equation gives a coefficient much smaller in magnitude, 
—.181 (fully robust se = .049). Not surprisingly, the 2SLS estimate is less precise. 
Levitt (1996) found similar results using a somewhat longer time period (with missing 
years for some states) and more instruments. 

Can we formally reject exogeneity of A log(prison;,) in the FD equation? When 
we estimate the reduced form for A log(prison;,), add the reduced-form residual 
to equation (11.40), and estimate the augmented model by POLS, the fully robust 
t statistic on the residual is 3.05. Therefore, we strongly reject exogeneity of 
A log(prison;;) in (11.40). 


Because it is difficult to find a truly exogenous instrument—even once we remove 
unobserved heterogeneity by FD or FE—it is prudent to study the properties of 
panel data IV methods when the instruments might be correlated with the idiosyn- 
crate errors, just as we did with the explanatory variables themselves in Section 
10.7.1. Under weak assumptions on the time series dependence, very similar calcu- 
lations show that the FEIV estimator has inconsistency on the order of T~! if the 
instruments Z; are contemporaneously exogenous, that is, E(zi,u) = 0. By contrast, 
the inconsistency in the FDIV estimator does not shrink to zero as T grows under 
contemporaneous exogeneity if either Z; ,-) or Z;,;,1 is correlated with u. Unfortu- 
nately, because the inconsistencies in FEIV and FDIV depend on the distribution of 
{Zi}, we cannot know for sure which estimator has less inconsistency. Further, as 
we discussed in Section 10.7.1, there is an important caveat to the O(T~') bias cal- 
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culation in the FEIV case: it assumes that the idiosyncratic errors {u;i} are weakly 
dependent, that is, I(0) (integrated of order zero). By contrast, the differencing 
transformation eliminates a unit root in {uş}, and may be preferred when {uj;} is 
a persistent process. More research needs to be done on this practically important 
issue. 


11.5 Unobserved Effects Models with Measurement Error 


One pitfall in using either FD or FE to eliminate unobserved heterogeneity is that the 
reduced variation in the explanatory variables can cause severe biases in the presence 
of measurement error. Measurement error in panel data was studied by Solon (1985) 
and Griliches and Hausman (1986). It is widely believed in econometrics that the 
differencing and FE transformations exacerbate measurement error bias (even though 
they eliminate heterogeneity bias). However, it is important to know that this con- 
clusion rests on the classical errors-in-variables (CEV) model under strict exogeneity, 
as well as on other assumptions. 
To illustrate, consider a model with a single explanatory variable, 


Vit = BX, + Ci + Uin, (11.41) 
under the strict exogeneity assumption 
E(uit | X}, Xi, ci) = 0, f= 1,2, .20,7, (11.42) 


where xj; denotes the observed measure of the unobservable x}. Condition (11.42) 
embodies the standard redundancy condition—that x; does not matter once x} is 
controlled for—in addition to strict exogeneity of the unmeasured and measured 
regressors. Denote the measurement error as ry = Xi — X}. Assuming that rj; is un- 
correlated with x;,—the key CEV assumption—and that variances and covariances 
are all constant across f¢, it is easily shown that, as N —> oo, the plim of the POLS 
estimator is 


L a Cov(Xit, Ci + Uit — Brit) 
1 = Por 
pim Prors = B Var(xir) 
Cov(xiz, ci) — Ba? 
Var(xir) i 


=$- (11.43) 


where o? = Var(rix) = Cov (xi ru); this is essentially the formula derived by Solon 
(1985). 
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From equation (11.43), we see that there are two sources of asymptotic bias in the 
POLS estimator: correlation between xy and the unobserved effect, c;, and a mea- 
surement error bias term, —fo?. If x; and c; are positively correlated and £ > 0, the 
two sources of bias tend to cancel each other out. 

Now assume that rj; is uncorrelated with x; for all ¢ and s, and for simplicity sup- 
pose that T = 2. If we first difference to remove c; before performing OLS we obtain 


Cov(Axiz, Aui — BATit) Cov(Axiz, Arj) 


=Bp-B 


plim frp = B + 
N 


00 Var(Axir) Var(Axir) 
= |o} = Cov(Tit, ri —1)] 
Spaa Var(Axir) 
_ o, (1 = Py) 
=a(1 Alto yea): (11.44) 


where p,.. = Corr(x;, x}, _;) and p, = Corr (ri, fi, 1-1), where we have used the fact that 
Cov(ra, ri 1) = 2p, and Var(Axi) = 2[o2.(1 — py) + o2(1 — p,)]; see also Solon 
(1985) and Hsiao (2003, p. 305). Equation (11.44) shows that, in addition to the ratio 
a? /o. being important in determining the size of the measurement error bias, the 
ratio (1 — p,)/(1 — p,.) is also important. As the autocorrelation in x; increases rel- 
ative to that in ry, the measurement error bias in B,p increases. In fact, as p,. — 1, 
the measurement error bias approaches —/. 

Of course, we can never know whether the bias in equation (11.43) is larger than 
that in equation (11.44), or vice versa. Also, both expressions are based on the CEV 
assumptions, and then some. If there is little correlation between Ax; and Ary, the 
measurement error bias from first differencing may be small, but the small correlation 
is offset by the fact that differencing can considerably reduce the variation in the 
explanatory variables. 

The FE estimator also has an attenuation bias under similar assumptions. You are 
asked to derive the probability limit of Brg in Problem 11.3. 

Consistent estimation in the presence of measurement error is possible under cer- 
tain assumptions. Consider the more general model 


Vie = Lity + Ow; + Ci + Uit, el ee aes he (11.45) 


where w} is measured with error. Write rj; = wp — w}, and assume strict exogeneity 
along with redundancy of wi: 


E(uj | Zi, W7, Wi, ci) = 9, eal es A (11.46) 


More Topics in Linear Unobserved Effects Models 367 


Replacing w} with w; and first differencing gives 


Ay, = Aziry + AW; + Auris — Arj. (11.47) 


The standard CEV assumption in the current context can be stated as 
E(ri | Zi, W; ci) = 9, PSM ae dD (11.48) 


which implies that r; is uncorrelated with zis, wx for all ¢ and s. (As always in the 
context of linear models, assuming zero correlation is sufficient for consistency, but 
not for usual standard errors and test statistics to be valid.) Under assumption (11.48) 
(and other measurement error assumptions), Ary is correlated with Aw;,. To apply an 
IV method to equation (11.47), we need at least one instrument for Aw;. As in the 
omitted variables and simultaneity contexts, we may have additional variables out- 
side the model that can be used as instruments. Analogous to the cross section case 
(as in Chapter 5), one possibility is to use another measure on w}, say hy. If the 
measurement error in h; is orthogonal to the measurement error in ws, all ¢ and s, 
then Ah; is a natural instrument for Aw; in equation (11.47). Of course, we can use 
many more instruments in equation (11.47), as any linear combination of z; and h; is 
uncorrelated with the composite error under the given assumptions. 

Alternatively, a vector of variables h; may exist that are known to be redundant 
in equation (11.45), strictly exogenous, and uncorrelated with rj; for all s. If Ah; is 
correlated with Awy, then an IV procedure, such as P2SLS, is easy to apply. It may 
be that in applying something like P2SLS to equation (11.47) results in asymptoti- 
cally valid statistics; this imposes serial independence and homoskedasticity assump- 
tions on Au;;. Generally, however, it is a good idea to use standard errors and test 
statistics robust to arbitrary serial correlation and heteroskedasticity, or to use a full 
GMM approach that efficiently accounts for these. An alternative is to use the 
FE2SLS method. Ziliak, Wilson, and Stone (1999) find that, for a model explaining 
cyclicality of real wages, the FD and FE estimates are different in important ways. 
The differences largely disappear when IV methods are used to account for mea- 
surement error in the local unemployment rate. 

So far, the solutions to measurement error in the context of panel data have 
assumed nothing about the serial correlation in rx. Suppose that, in addition to as- 
sumption (11.46), we assume that the measurement error is serially uncorrelated: 


E(ritis) =0, s#t (11.49) 


Assumption (11.49) opens up a solution to the measurement error problem with 
panel data that is not available with a single cross section or independently pooled 
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cross sections. Under assumption (11.48), rj is uncorrelated with w¥ for all ¢ and s. 
Thus, if we assume that the measurement error r; is serially uncorrelated, then rj, is 
uncorrelated with wj; for all £ 4 s. Since, by the strict exogeneity assumption, Au; is 
uncorrelated with all leads and lags of z; and w;,, we have instruments readily avail- 
able. For example, w;,;-2 and wj;,;-3 are valid as instruments for Aw; in equation 
(11.47); so is w; 41. Again, P2SLS or some other IV procedure can be used once the 
list of instruments is specified for each time period. However, it is important to re- 
member that this approach requires the r; to be serially uncorrelated, in addition to 
the other CEV assumptions. 

The methods just covered for solving measurement error problems all assume strict 
exogeneity of all explanatory variables. Naturally, things get harder when measure- 
ment error is combined with models with only sequentially exogenous explanatory 
variables. Nevertheless, differencing away the unobserved effect and then selecting 
instruments—based on the maintained assumptions—generally works in models with 
a variety of problems. We now turn explicitly to models under sequential exogeneity 
assumptions. 


11.6 Estimation under Sequential Exogeneity 


11.6.1 General Framework 
We can also apply the FDIV methods in Section 11.4 to estimate the standard model, 
Vit = Xuß + Ci + Uit, E E F, (11.50) 


under a sequential exogeneity assumption, properly modified for the presence of the 
unobserved effect, c;. Chamberlain (1992b) calls these sequential moment restrictions, 
which can be written in conditional expectations form as 


E(uit | Xit, Xi t1,- - -3 Xil, Ci) = 9, a RS . (11.51) 


When assumption (11.51) holds, we say that {xx} is sequentially exogenous condi- 
tional on the unobserved effect. 
Given equation (11.50), assumption (11.51) is equivalent to 


E( Yi | Xit Xi, t1, +++, Xa, Ci) = E( Yir | Xir, Ci) = Xuß + ci, (11.52) 


which provides a simple interpretation of sequential exogeneity conditional on ¢;: 
once X; and c; have been controlled for, no past values of x; affect the expected value 
of yi. Conditioning on the explanatory variables only up through time ¢ is more 
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natural than the strict exogeneity assumption, which requires conditioning on future 
values of x; as well. As we proceed, it is important to remember that equation (11.52) 
is what we should have in mind when interpreting the estimates of p. Estimating 
equations in first differences, such as (11.35), do not have natural interpretations 
when the explanatory variables are only sequentially exogenous. 

As we will explicitly show in the next subsection, models with lagged dependent 
variables are naturally analyzed under sequential exogeneity. Keane and Runkle 
(1992) argue that panel data models with heterogeneity for testing rational expec- 
tations hypotheses do not satisfy the strict exogeneity requirement. But they do sat- 
isfy sequential exogeneity; in fact, the conditioning set in assumption (11.51) can 
include all variables observed at time ¢ — 1. 

As we saw in Section 7.2, in panel data models without unobserved effects, strict 
exogeneity is sometimes too strong an assumption, even in static and finite dis- 
tributed lag models. For example, suppose 


Yit = Ly + ON + Ci + Uir, (11.53) 
where {z;;} is strictly exogenous and {h} is sequentially exogenous: 

E(u | Zi, hit,- - , hil, ci) = 9. (11.54) 
Further, h; is influenced by past yr, say 

hit = Zuğ + NYi, -1 + Wei + Fi- (11.55) 


For example, let yp be per capita condom sales in city i during year ż¢, and let hi be 
the HIV infection rate for city i in year t. Model (11.53) can be used to test whether 
condom usage is influenced by the spread of HIV. The unobserved effect c; contains 
city-specific unobserved factors that can affect sexual conduct, as well as the inci- 
dence of HIV. Equation (11.55) is one way of capturing the fact that the spread of 
HIV depends on past condom usage. Generally, if E(r; +1ux) = 0, it is easy to 
show that E(h;, +1) = NE( Yii) = NE(u}) > 0 if 7 > 0 under equations (11.54) and 
(11.55). Therefore, strict exogeneity fails unless y = 0. 

Sometimes in panel data applications one sees variables that are thought to be 
contemporaneously endogenous appear with a lag, rather than contemporaneously. 
So, for example, we might use h; ;—ı in place of hy in equation (11.53) because we 
think A; and up are correlated. As an example, suppose y; is percentage of flights 
cancelled by airline 7 in year t, and hy is profits in the same year. We might specify 
Vir = Ziy + Ôhi 1—1 + Ci + Uz for strictly exogenous Z. Of course, at t+ 1, the regres- 
sors are X; 741 = (Zi,441, ir), which is correlated with uj, if hj, is. As we discussed in 
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Section 10.7, the FE estimator arguably has inconsistency of order 1/T in this situa- 
tion. But if we are willing to assume sequential exogeneity, we need not settle for an 
inconsistent estimator at all. 

A general approach to estimation under sequential exogeneity follows in Section 
11.4. We take first differences to remove c; and obtain equation (11.35), written in 
stacked form as equation (11.36). The only issue is where the instruments come from. 
Under assumption (11.51), 


E(xj,uir) = 0 ep Lees (11.56) 


which implies the orthogonality conditions 


Ba Au) =0,  s=1,...,t—1;t=2,...,T. (11.57) 


Therefore, at time ¢, the available instruments in the FD equation are in the vector 
x? ,_;, where 


Xf = (Xa, Xa,-.., Xir). (11.58) 
Therefore, the matrix of instruments is simply 
W; = diag(x},,x,---,X? 7-4); (11.59) 


which has T — 1 rows. Because of sequential exogeneity, the number of valid instru- 
ments increases with t. 

Given W,, it is routine to apply GMM estimation. But, as we discussed in Section 
11.4, some simpler strategies are available. One useful one is to estimate a reduced 
form for AX; separately for each ¢. So, at time t, run the regression Ax; on x?,_j, 
i=1,...,N, and obtain the fitted values, Axy. Of course, the fitted values are all 
1 x K vectors for each ¢. Then, estimate the FD equation (11.35) by pooled IV using 
instruments Ax;;. It is simple to obtain robust standard errors and test statistics from 
such a procedure because the first stage estimation to obtain the instruments can be 
ignored (asymptotically, of course). This is the same set or estimates obtained if we 
choose W; as in (11.59) and choose as the weighting matrix (W’W/N)~', that is, we 
obtain the system 2SLS estimator. 

Given an initial consistent estimator, we can obtain the efficient GMM weighting 
matrix. In most applications, there is a reasonable set of assumptions under which 
E(Wiee/W;) = E(W;QW;) where e; = Au; and Q = E(e;e!), in which case the GMM 
3SLS estimator is an efficient GMM estimator. See Wooldridge (1996) and Arellano 
(2003) for examples. 

One potential problem with estimating the FD equation by IVs that are simply lags 
of X; is that changes in variables over time are often difficult to predict. In other 
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words, Ax; might have little correlation with x?,_,, in which case we face a problem 
of weak instruments. In one case, we even lose identification: if X; = a; + Xi, 1-1 + eit 
where E(e;;|x;;-1,-..,Xi) = 0—that is, the elements of x; are random walks with 
drift—then E(Ax;| x; ;-1,...,Xi) =0, and the rank condition for IV estimation 
fails. In the next subsection we will study this problem further in the context of the 
AR(1) model. 

As a practical matter, the column dimension of W; can be large, especially with 
large T. Using many overidentifying restrictions is known to contribute to poor finite 
sample properties of GMM, especially if many of the instruments are weak; see, for 
example, Tauchen (1986), Altonji and Segal (1996), Ziliak (1997), Stock, Wright, 
and Yogo (2002), and Han and Phillips (2006). It might be better (even though it is 
asymptotically less efficient, or no more efficient) to use just a couple of lags, say 
Wir = (Xi,1-1, Xi,r-2), as the instruments at time ¢ > 3, with wp = xj. 

If the model contains strictly exogenous variables, as in equation (11.53), at a 
minimum the instruments at time ¢ would include Az;,—that is, AZ; acts as its own 
instrument. We can still only include lags of the seqentially exogenous variables. So, 
at time ¢ in the first difference of (11.53), Avi, = Azixy + Ahi + Aun, t = 2,...,T, the 
available IVs would be w; = (Aziz, hi 1, hi t2, ..-, 41). Again, either a full GMM 
procedure or a simple pooled IV procedure (after separate estimation of each reduced 
form) can be applied, and perhaps only a couple of lags of h; would be included. 


11.6.2 Models with Lagged Dependent Variables 


A special case of models under sequential exogeneity restrictions are autoregressive 
models. Here we study the AR(1) model and follow Arellano and Bond (1991). In the 
model without any other covariates, 


Vit = PVit-1 + Ci + Uin, a eee be (11.60) 
E(uit | Vi,r-1, Vi,t-2) +++» Yio, Ci) = 0, (11.61) 


so that our first observation on y is at t = 0. Assumption (11.61) says that we have 
the dynamics completely specified: once we control for c;, only one lag of yj; is nec- 
essary. This is an example of a dynamic completeness conditional on the unobserved 
effect assumption. When we let xy = y; 1, we see immediately that the AR(1) model 
satisfies the sequential exogeneity assumptions (conditional on c;). 

One application of model (11.60) is to determine whether {yp} exhibits state 
dependence: after controlling for systematic, time-constant differences c;, does last 
period’s outcome on y help predict this period’s outcome? In the AR(1) model, the 
answer is yes if p 4 0. 
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At time ¢ in the FD equation Ay; = pAy; -1 + Aug, t= 2,...,7, the available 
instruments are wi, = (Vio,---, ¥iz-2). Anderson and Hsiao (1982) proposed pooled 
IV estimation of the FD equation with instrument y; -2 (in which case all T — 1 
periods can be used) or Ay;,;-2 (in which case only T — 2 periods can be used). 
Arellano and Bond (1991) suggested full GMM estimation using all of the available 
instruments. The pooled IV estimator that uses, say, y;ọ as the IV at ¢ = 2 and then 
(Vi t2, Vir-3) for t = 3,..., T, is easy to implement. It is likely more efficient than the 
Anderson and Hsiao approach but less efficent than the full GMM approach. In this 
method, T — 1 separate reduced forms are estimated for Ay;,;-1. 

As noted by Arellano and Bond (1991), the differenced errors Au; and Au; +1 are 
necessarily correlated under assumption (11.61). Therefore, at a minimum, any esti- 
mation method should account for this serial correlation in calculating standard 
errors and test statistics. GMM handles this serial correlation by using an efficient 
weighting matrix. Differenced errors two or more periods apart are uncorrelated, and 
this actually simplifies the optimal weighting matrix. Further, if we add conditional 
homoskedasticity, Var(uj | Vi,t-1, Vi,t-2; -< <, Vio, Ci) = o, then the 3SLS version of the 
weighting matrix can be used. See Arellano and Bond (1991) and also Wooldridge 
(1996) for verification. As is usually the case, even after we settle on the instruments, 
there are a variety of ways to use those instruments. 

A more general version of the model is 


Vit = Oi + PYi, t1 + Ziy + Ci + Uir, t=1,...,T7, (11.62) 


where 0, denotes different period intercepts and {z;;} is strictly exogenous. When we 
difference equation (11.62), 


AVi = N; + pAVir-1 + Aziny + Atii, t= lan Ty (11.63) 


the available instruments (in addition to time period dummies) are (Z;, Vj,;-2,---, Yio). 
We might not want to use all of z; for every time period. Certainly we would use Azz, 
and perhaps a lag, Az; |. If we add sequentially exogenous variables, say hj, to 
(11.62) then (h; ;1,...,hj;,) would be added to the list (and Ah; would appear in the 
equation). As always, we might not use the full set of lags in each time period. 

Even though our estimating equation is in first differences, it is important to 
remember that (11.63) is an estimating equation. We should interpret the estimates 
in light of the original model (11.62). It is equation (11.62) that represents a dynam- 
ically complete conditional expectation. Equation (11.63) does not even satisfy 
E(Auj: | Ay;,:-1, AZir) = 0, which is why POLS estimation of it is inconsistent. Apply- 
ing IV to the FD equation is simply a way to estimate the parameters on the original 
model. 
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Example 11.3 (Estimating a Dynamic Airfare Equation): Consider an AR(1) model 
for the log of airfare using the data in AIRFARE.RAW: 


Tfare i, = 0, + plfare;;-1 + yconceni + Ci + Uit, 


where we include a full set of year dummies. We assume the concentration ratio is 
strictly exogenous and that at most one lag of /fare is needed to capture the dynamics. 
Because we have data for 1997 through 2000, (11.62) is specified for three years. 
After differencing, we have only two years of data. 

The FD equation is 


Alfarei, = N, + pAlfare; 1 + yAconceni + Ati, t = 1999, 2000. 


If we estimate this equation by POLS, the estimators are inconsistent because 
Alfare;i +1 is correlated with Aup; we include the OLS estimates for comparison. We 
also apply the simple pooled IV procedure, where separate reduced forms are esti- 
mated for A/fare; +1: one for 1999, with /fare; 2 and Aconcen;, in the reduced form, 
and one for 2000, with /fare; ;-2, Ifare;,;-3 and Aconcen;, in the reduced form. The 
fitted values are used in the pooled IV estimation, with robust standard errors. (We 
only use Aconcen;, in the IV list at time t.) Finally, we apply the Arellano and Bond 
(1991) GMM procedure. The results are given in Table 11.2. 

As is seen from column (1), the POLS estimate of p is actually negative and 
statistically different from zero. By contrast, the two IV methods give positive and 
statistically significant estimates. The GMM estimate of p is larger, and it also has a 
smaller standard error (as we would hope for GMM). Compared with the POLS es- 


Table 11.2 
Dynamic Airfare Model, First Differencing IV Estimation 
Dependent Variable Ifare 
(1) (2) (3) 

Explanatory Variable Pooled OLS Pooled IV Arellano-Bond 
Ifare_| —.126 219 .333 

(.027) (.062) (.055) 
concen .076 126 $2 

(.053) (.056) (.040) 
N 1,149 1,149 1,149 


The pooled IV estimates on the FD equation are obtained by first estimating separate reduced forms for 
Alfare_; for 1999 and 2000, where the IVs for 1999 are /fare_. and Aconcen and those for 2000 are /fare_2, 
Ifare_3, and Aconcen. 

The POLS and pooled IV standard errors are robust to heteroskedasticity and serial correlation. The 
GMM standard errors are obtained from an optimal weighting matrix. 

Separate year intercepts were estimated for all procedures (not reported). 

The Arellano and Bond estimates were obtained using the xtabond command in Stata 9.0. 
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timate, the IV estimates of the concen coefficient are larger and, especially for GMM, 
much more statistically significant. 


In Example 11.3, the estimated p is quite far from one, and the IV estimates appear 
to be well behaved. In some applications, p is close to unity, and this leads to a weak 
instrument problem, as discussed briefly in the previous subsection. The problem is 
that the lagged change, Ay; ;-1, is not very highly correlated with (y;,;-2,..., vio). In 
fact, when p = 1, so that {yx} has a unit root, there is no correlation, and both the 
pooled IV and Arellano and Bond procedures break down. 

Arellano and Bover (1995) and Ahn and Schmidt (1995) suggest adding additional 
moment conditions that improve the efficiency of the GMM estimator. For example, 
in the basic model (11.61), Ahn and Schmidt show that assumption (11.62)—and 
actually weaker versions based on zero correlation—implies the additional set of 
nonredundant orthogonality conditions 


E(v;rAu;, :-1) = 0, t=2,...,T-1, 


where vir = c; + uir is the composite error in the last time period. Then, we can 
specify the equations 


Ayir = pAY;,1-1 + Atir, f= EEE 


Vit = PYi, T-1 + Vir 


and combine the original Arellano and Bond orthogonality conditions for the first set 
of equations with those of Ahn and Schmidt for the latter equation. The resulting 
moment conditions are nonlinear in p, which takes us outside the realm of linear 
GMM. We cover nonlinear GMM methods in Chapter 14. Blundell and Bond (1998) 
obtained additional linear moment restrictions in the levels equation yi = pyi, -1 + 
vir based on yi being drawn from a steady-state distribution. The extra moment 
conditions are especially helpful for improving the precision (and even reducing the 
bias) in the GMM estimator when p is close to one. See also Hahn (1999). Of course, 
when p = 1 it makes no sense to assume the existence of a steady-state distribution 
(although that condition can be weakened somewhat). See Arellano (2003) and Bal- 
tagi (2001, Chapter 8) for further discussion of the AR(1) model, with and without 
strictly exogenous and sequentially exogenous explanatory variables. 


11.7 Models with Individual-Specific Slopes 


The unobserved effects models we have studied up to this point all have an additive 
unobserved effect that has the same partial effect on y,, in all time periods. This 
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assumption may be too strong for some applications. We now turn to models that 
allow for individual-specific slopes. 


11.7.1 Random Trend Model 
Consider the following extension of the standard unobserved effects model: 
Vie = Ci + git + XuB + it, FH 12a TE: (11.64) 


This is sometimes called a random trend model, as each individual, firm, city, and so 
on is allowed to have its own time trend. The individual-specific trend is an additional 
source of heterogeneity. If y,, is the natural log of a variable, as is often the case in 
economic studies, then g; is (roughly) the average growth rate over a period (holding 
the explanatory variables fixed). Then equation (11.64) is referred to a random growth 
model; see, for example, Heckman and Hotz (1989). 

In many applications of equation (11.64) we want to allow (c;, gi) to be arbitrarily 
correlated with x». (Unfortunately, allowing this correlation makes the name “ran- 
dom trend model” conflict with our previous usage of random versus fixed effects.) 
For example, if one element of X; is an indicator of program participation, equation 
(11.64) allows program participation to depend on individual-specific trends (or 
growth rates) in addition to the level effect, c;. We proceed without imposing restric- 
tions on correlations among (ci, gi, Xir), so that our analysis is of the fixed effects 
variety. A random effects approach is also possible, but it is more cumbersome; see 
Problem 11.5. 

For the random trend model, the strict exogeneity assumption on the explanatory 
variables is 


E(uir | Xi,- -< XiT, Ci, gi) = 0, (11.65) 
which follows definitionally from the conditional mean specification 
E( yi | Xis- -+ XiT, Ci Gi) = EC Vip | Xit Ci, Gi) = Ci + git + Xup. (11.66) 


We are still primarily interested in consistently estimating £. 
One approach to estimating f is to difference away c;: 


AY; = Gi + AXinB + Aui, f= E (11.67) 


where we have used the fact that g;t — g;(t — 1) = gi. Now equation (11.67) is just 
the standard unobserved effects model we studied in Chapter 10. The key strict exo- 
geneity assumption, E(Au; | gi, Axio,..., Axir) = 0, t= 2,3,...,7, holds under as- 
sumption (11.65). Therefore, we can apply FE or FD methods to equation (11.67) in 
order to estimate f. 
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In differencing the equation to eliminate c; we lose one time period, so that equa- 
tion (11.67) applies to T — 1 time periods. To apply FE or FD methods to equation 
(11.67) we must have T — 1 > 2, or T > 3. In other words, f can be estimated con- 
sistently in the random trend model only if T > 3. 

Whether we prefer FE or FD estimation of equation (11.67) depends on the 
properties of {Auj;: t = 2,3,..., T}. As we argued in Section 10.6, in some cases it is 
reasonable to assume that the FD of {u;i} is serially uncorrelated, in which case the 
FE method applied to equation (11.67) is attractive. If we make the assumption that 
the uj, are serially uncorrelated and homoskedastic (conditional on x;, c;, gi), then FE 
applied to equation (11.67) is still consistent and asymptotically normal, but not ef- 
ficient. The next subsection covers that case explicitly. 


Example 11.4 (Random Growth Model for Analyzing Enterprise Zones): Papke 
(1994) estimates a random growth model to examine the effects of enterprise zones on 
unemployment claims: 


log(uclms;,) = 0, + Ci + git + Ô1eZit + Uit, 


so that aggregate time effects are allowed in addition to a jurisdiction-specific growth 
rate, g;. She first differences the equation to eliminate c; and then applies FE estima- 
tion to the differences. The data are in EZUNEM.RAW. The estimate of 6; is 
ôi = —.192 with se(ô1) = .085. Thus, enterprise zone designation is predicted to lower 
unemployment claims by about 19.2 percent, and the effect is statistically significant 
at the 5 percent level. 


Friedberg (1998) provides an example, using state-level panel data on divorce rates 
and divorce laws, that shows how important it can be to allow for state-specific 
trends. Without state-specific trends, she finds no effect of unilateral divorce laws on 
divorce rates; with state-specific trends, the estimated effect is large and statistically 
significant. The estimation method Friedberg uses is the one we discuss in the next 
subsection. 

In using the random trend or random growth model for program evaluation, it 
may make sense to allow the trend or growth rate to depend on program participa- 
tion: in addition to shifting the level of y, program participation may also affect the 
rate of change. In addition to progi, we would include prog; - t in the model: 


Yir = Or + Ci + git + Ziy + Ò1PrOgi + ÒPrOJi ` t + Uir. 
Differencing once, as before, removes cj, 


AVi = Či + gi + Adiuy + 81A progi + 62A( progi - t) + Aui. 
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We can estimate this differenced equation by FE. An even more flexible specifica- 
tion is to replace prog; and prog;-t with a series of program indicators, 
proglit,...,progM;,, where progj;, is one if unit į in time ¢ has been in the program 
exactly j years, and M is the maximum number of years the program has been 
around. 

If {wir} contains substantial serial correlation—more than a random walk—then 
differencing equation (11.67) might be more attractive. Denote the second difference 
of yi, by 


A? Yyy = Aya — AY; 1 = Vit — Yi 1-1 + Yiz 


with similar expressions for Ax, and A? ui. Then 
A? Ya = AP Xup + Aun, t=3,.. T. (11.68) 


As with the FE transformation applied to equation (11.67), second differencing also 
eliminates g;. Because A*uj, is uncorrelated with Axis, for all t and s, we can estimate 
equation (11.68) by POLS or a GLS procedure. 

When T = 3, second differencing is the same as first differencing and then apply- 
ing FE. Second differencing results in a single cross section on the second-differenced 
data, so that if the second-difference error is homoskedastic conditional on x;, the 
standard OLS analysis on the cross section of second differences is appropriate. 
Hoxby (1996) uses this method to estimate the effect of teachers’ unions on education 
production using three years of census data. 

If xy contains a time trend, then Ax; contains the same constant for t= 
2,3,...,7, which then gets swept away in the FE or FD transformation applied to 
equation (11.67). Therefore, X; cannot have time-constant variables or variables that 
have exact linear time trends for all cross section units. 


11.7.2 General Models with Individual-Specific Slopes 


We now consider a more general model with interactions between time-varying ex- 
planatory variables and some unobservable, time-constant variables: 


Vit = Wii + Xuf + uit, $= 1,2, 00657, (11.69) 


where wy is 1 x J, a; is J x 1, xi is 1 x K, and $ is K x 1. The standard unobserved 
effects model is a special case with w; = 1; the random trend model is a special case 
with w; = w, = (1,02). 

Equation (11.69) allows some time-constant unobserved heterogeneity, contained 
in the vector a;, to interact with some of the observable explanatory variables. For 
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example, suppose that progit is a program participation indicator and y,, is an out- 
come variable. The model 


Vit = Gi + An ` progit + XitB + uit 


allows the effect of the program to depend on the unobserved effect a; (which may or 
may not be tied to aj). While we are interested in estimating f, we are also interested 
in the average effect of the program, «2 = E(aj2). We cannot hope to get good esti- 
mators of the aj in the usual case of small T. Polachek and Kim (1994) study such 
models, where the return to experience is allowed to be person-specific. Lemieux 
(1998) estimates a model where unobserved heterogeneity is rewarded differently in 
the union and nonunion sectors. 

In the general model, we initially focus on estimating f and then turn to estimation 
of a = E(a;), which is the vector of average partial effects for the covariates zi. The 
strict exogeneity assumption is the natural extension of assumption (11.65): 


ASSUMPTION FE.1': E(w | wi, x;,a;) = 0, f= 1,2,...,7. 
Along with equation (11.69), Assumption FE.1’ is equivalent to 
E( Yi | Way...) Wir, Xi, 01 XiT, ai) = E( Vi | Wit Xie, ai) = Wii + Xup, 


which says that, once Wy, x;, and a; have been controlled for, (wis, xis) for s 4 t do 
not help to explain y; 

Define W; as the T x J matrix with tth row wy, and similarly for the T x K matrix 
X;. Then equation (11.69) can be written as 


y= Wia; + XP + uy. (1 1.70) 
Assuming that W/W; is nonsingular (technically, with probability one), define 
M; = Ir — Wi(W!W,) | Wi, (11.71) 


the projection matrix onto the null space of W; (the matrix W;(W/W;) 'W) is the 
projection matrix onto the column space of W;). In other words, for each cross 
section observation i, Miy; is the T x 1 vector of residuals from the time series 
regression 


Ya ON We, t= 1,2,...,T. (11.72) 


In the basic FE case, regression (11.72) is the regression yp on 1, t=1,2,..., 
T, and the residuals are simply the time-demeaned variables. In the random trend 
case, the regression is y; on 1, £, t= 1,2,..., T, which linearly detrends y; for each i. 
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The T x K matrix M;X; contains as its rows the 1 x K vectors of residuals from 
the regression x; on Wr, t= 1,2,...,7. The usefulness of premultiplying by M; is 
that it allows us to eliminate the unobserved effect a; by premultiplying equation 
(11.70) through by M; and noting that M;W; = 0: 


¥;, = XP + it, (11.73) 


where y,; = Miy,, X; = M;X;, and ü; = Mju;. This is an extension of the within 
transformation used in basic FE estimation. 

To consistently estimate f by system OLS on equation (11.73), we make the fol- 
lowing assumption: 


ASSUMPTION FE.2': rank E(X/X;) = K, where X; = M;X;. 


The rank of M; is T — J, so a necessary condition for Assumption FE.2’ is J < T. In 
other words, we must have at least one more time period than the number of ele- 
ments in a;. In the basic unobserved effects model, J = 1, and we know that T > 2 is 
needed. In the random trend model, J = 2, and we need T > 3 to estimate f. 

The system OLS estimator of equation (11.73) is 


A N .. .. = N .. N .. .. a N .. 
ber = (>: xix) (>: x = P+ (x 5 xix) (m 5 x) . 
i=l i=] i=l i=1 


Under Assumption FE.1’, E(X/u;) = 0, and under Assumption FE.2’, rank E(X/X;) 
= K, and so the usual consistency argument goes through. Generally, it is possible 
that for some observations, X!X; has rank less than K. For example, this result occurs 
in the standard FE case when x; does not vary over time for unit i. However, under 
Assumption FE.2’, Îpg should be well defined unless our cross section sample size is 
small or we are unlucky in obtaining the sample. 

Naturally, the FE estimator is /N-asymptotically normally distributed. To obtain 
the simplest expression for its asymptotic variance, we add the assumptions of con- 
stant conditional variance and no (conditional) serial correlation on the idiosyncratic 
errors {u;i t= 1,2,...,T}. 


ASSUMPTION FE.3':  E(uju} | wi, Xi, a) = 071. 
Under Assumption FE.3’, iterated expectations implies 


E(X/uju/X;) = E[X!E(uyu! | W;, X;)X;] = o2E(X/X;). 
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Using essentially the same argument as in Section 10.5.2, under Assumptions FE.1’, 
FE.2’, and FE.3’, Avar VN(Îrg — P) = o2[E(X/X;)]| |, and so Avar(Brr) is 
consistently estimated by 


-1 
N 
Avat(Brr) = 6? (> xix) , (11.74) 
i=l 

where ô? is a consistent estimator for a2. As with the standard FE analysis, we must 
use some care in obtaining 62. We have 


T 
XC E(ü}) = E(ü;ü;) = E[E(u;M;u; | W;, X;)] = E{tr[E(u;u;M; | W;, X;)]} 
=l 

= E{tr[E(wu; |W;, X;)MiJ} = Eftr(o2M,)] = (T — J)o? (11.75) 
since tr(M;) = T— J. Let it = Yy —XiuBpe. Then equation (11.75) and standard 
arguments imply that an unbiased and consistent estimator of a? is 


2 IN(T -J-K YS YR = SSR/[N(T — J) — K]. (11.76) 


=1 =l 
The SSR in equation (11.76) is from the pooled regression 
Vig ON Xir, Fly 2 seg Lt = L eN, (11.77) 


which can be used to obtain B;,. Division of the SSR from regression (11.77) by 
N(T —J)—K produces G2. The standard errors reported from regression (11.77) 
will be off because the SSR is only divided by NT — K; the adjustment factor is 
{(NT — K)/[N(T — J) — K}}"?. 

A standard F statistic for testing hypotheses about £ is also asymptotically valid. 
Let Q be the number of restrictions on f under Hp, and let SSR, be the restricted sum 
of squared residuals from a regression like regression (11.77) but with the restrictions 
on f imposed. Let SSR, be the unrestricted sum of squared residuals. Then 

(SSR, — SSR») [N(T —J)— K] 


fa 7 (11.78) 


can be treated as having an F distribution with Q and N(T — J)— K degrees of 
freedom. Unless we add a (conditional) normality assumption on u;, equation (11.78) 
does not have an exact F distribution, but it is asymptotically valid because 
Q-F ~ x6. 

Without Assumption FE.3’, equation (11.74) is no longer valid as the variance esti- 
mator and equation (11.78) is not a valid test statistic. But the robust variance matrix 
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estimator (10.59) can be used with the new definitions for X; and ii. This step leads 
directly to robust Wald statistics for multiple restrictions. 

To obtain a consistent estimator of a = E(a;), premultiply equation (11.72) by 
(W/W,) 'W! and rearrange to get 


= (WIW) W; (y; — XiB) — (W; Wi)! Wha. (11.79) 


Under Assumption FE.1’, E(u; | Bal = 0, and so the second term in equation (11.79) 
has a zero expected value. Therefore, assuming that the expected value exists, 


= E[(W;W;) 'W;(y; — X:B)]- 


So a consistent, v N-asymptotically normal estimator of a is 


a=—N ow" y: — XiBrp)- (11.80) 


With fixed T we cannot consistently estimate the a; when they are viewed as 
parameters. However, for each i, the term in the summand in equation (11.80), call it 
a;, is an unbiased estimator of a; under Assumptions FE.1’ and FE.2’. This conclu- 
sion is easy to show: E(â; |W, X) = (W/W,) 'W/E(y; |W, X) — X;E(Bre | W,X)] = 
(WIW) 'W![W,a; + XB — X;f] = a;, where we have used the fact that E(B; |W, X) 
= ß. The estimator @ simply averages the a; over all cross section observations. 

The asymptotic variance of /N(a@— a) can be obtained by expanding equation 
(11.80) and plugging in VN (Bye — P) = EX; X) 1 (N71? YO", Xfm) + op(1). A 
consistent estimator of Avar VN (å — a) can be shown to be 


rac — CA! X'a)] (a; — @) — ĈA! X ’/û,]', (11.81) 


where = (W/W,) 'Wi(y,—XiBre), C= ER (WW) WX, A= 
N- DF 1 x! IX, and û; = y; — T This estimator is fully robust in the sense that it 
does not rely on Assumption FE.3’. As usual, asymptotic standard errors of the ele- 
ments of @ are obtained by multiplying expression (11.81) by N and taking the square 
roots of the diagonal elements. As special cases, expression (11.81) can be applied to 
the traditional unobserved effects and random trend models. 

The estimator @ in equation (11.81) is not necessarily the most efficient. A better 
approach is to use the moment conditions for Bre and @ simultaneously. This leads 
to nonlinear instrumental variables methods, something we take up in Chapter 14. 
Chamberlain (1992a) covers the efficient method of moments approach to estimating 
a and f; see also Lemieux (1998). 
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11.7.3 Robustness of Standard Fixed Effects Methods 


In the previous two sections, we assumed that a set of slope coefficients, $, did not 
vary across unit, and that we know the set of slope coefficients, generally a; in (11.69), 
that might vary across i. Even in this situation, the allowable dimension of the 
unobserved heterogeneity is restricted by the number of time periods, 7. In this sub- 
section we study what happens if we mistakenly treat some random slopes as if they 
are fixed and apply standard FE methods. We might ignore some heterogeneity be- 
cause we are ignorant of the scope of heterogeneity in the model or because we sim- 
ply do not have enough time periods to proceed with a general analysis. 
We begin with an extension of the usual model to allow for unit-specific slopes, 


Vit = Ci + Xubi + Ui (11.82) 


E(uj | Xi, ci, bi) = 0, bal iees E; (11.83) 


where b; is K x 1. However, unlike in Section 11.6.2, we now ignore the heterogene- 
ity in the slopes and act as if b; is constant all 7. We think c; might be correlated with 
at least some elements of x;y, and therefore we apply the usual FE estimator. The 
question we address here is, when does the usual FE estimator consistently estimate 
the population average effect, $ = E(b;)? 

In addition to assumption (11.83), we naturally need the usual FE rank condition, 
Assumption FE.2. But what else is needed for FE to consistently estimate f? It helps 
to write b; = $ + d; where the unit-specific deviation from the average, d;, necessarily 
has a zero mean. Then 


Yit = Ci + Xf + Xidi + Ui = Ci + Xf + Vit, (1 1.84) 


where Vi = Xid; + ui. AS we saw in Section 10.5.6, a sufficient condition for consis- 
tency of the FE estimator (along with Assumption FE.2) is 


E(x,i;)=0, t=1,...,T. (11.85) 


But üy = Xd; + üp and E(Xi ü) = 0 by (11.83). Therefore, the extra assumption that 
ensures (11.85) is E(x},x;,d;) = 0 for all ¢. A sufficient condition, and one that is easier 
to interpret, is 


E(b;| žų) =E(b) =f, t=1,...,T. (11.86) 


Importantly, condition (11.86) allows the slopes, b;, to be correlated with the regres- 
sors x; through permanent components. What it rules out is correlation between 
idiosyncratic movements in x; We can formalize this statement by writing 
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Xi =f; +r, t= 1,..., T. Then (11.86) holds if E(b; | ri, r2,..., 87) = E(b;). So b; is 
allowed to be arbitrarily correlated with the permanent component, f;. (Of course, 
Xj; =f; + ry is a special representation of the covariates, but it helps to illustrate 
condition (11.86).) 

Wooldridge (200Sa) studies a more general class of estimators that includes the 
usual FE and random trend estimator. Write 


Vit = Wa; + Xib; + Uit, EEE (11.87) 


where w, is a set of deterministic functions of time. We maintain the standard as- 
sumption (11.83) but with a; in place of c;. Now, the “fixed effects” estimator sweeps 
away a; by netting out w; from xy. In particular, now let X; denote the residuals from 
the regression X; on w, = 1,...,T. 

In the random trend model, w, = (1, t), and so the elements of x; have unit-specific 
linear trends removed in addition to a level effect. Removing even more of the het- 
erogeneity from {X;} makes it even more likely that (11.86) holds. For example, if 
xX; = f; + h;t + ry, then b; can be arbitrarily correlated with (f;,h;). Of course, indi- 
vidually detrending the x; requires at least three time periods, and it decreases the 
variation in X;, compared to the usual FE estimator. Not surprisingly, increasing the 
dimension of w, (subject to the restriction dim(w,) < T), generally leads to less pre- 
cision of the estimator. See Wooldridge (2005a) for further discussion. 

Of course, the FD transformation can be used in place of, or in conjunction with, 
unit-specific detrending; see Section 11.6.1 for the random growth model. For exam- 
ple, if we use the FD transformation followed by the within transformation, it is 
easily seen that a condition sufficient for consistency of the resulting estimator for ß is 


E(b;| Aša) =E(b),  t=2,...,T, (11.88) 


where Až; = Ax; — Ax; are the demeaned first differences. 

The results for the FE estimator (in the generalized sense of removing unit-specific 
means and possibly trends) extend to fixed effects IV methods, provided we add a 
constant conditional covariance assumption of the type introduced in Section 6.4.1, 
but extended to the panel data case. Murtazashvili and Wooldridge (2008) derive a 
simple set of sufficient conditions. In the model with general trends, we assume the 
natural extension of Assumption FEIV.1, that is, E(wj;|z;,a;,b;) = 0 for all ¢, along 
with Assumption FEIV.2. We modify assumption (11.86) in the obvious way: replace 
X; with z;,, the individual-specific detrended instruments: 


E(b;|%2) = Eb) =f, t=1,...,T. (11.89) 
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But something more is needed. Murtazashvili and Wooldridge (2008) show that, 
along with the previous assumptions, a sufficient condition is 


Cov(Xir,b; | Ži) = Cov(%iz,b)),  t=1,...,T. (11.90) 


Note that the covariance Cov(X;;,b;), a K x K matrix need not be zero, or even 
constant across time. In other words, unlike condition (11.86), we can allow the 
detrended covariates to be correlated with the heterogeneous slopes. But the condi- 
tional covariance cannot depend on the time-demeaned instruments. 

We can easily show why (11.90) suffices with the previous assumptions. First, 
if E(d;|Zi) =0, which follows from E(b;| Zi) = E(b;), then Cov(Xj;,d;| Zi) = 
E(Xid; | Zi:), and so E(Xid;| Zi) = E(Xid;) = y, under the previous assumptions. 
Write X;d; = y,+ ri where E(rri |Z) =0, t=1,...,7. Then we can write the 
transformed equation as 


Vit = Xip + Xidi + ün = Vig = XB + y; + ru + üi. (11.91) 


Now, if xj, contains a full set of time period dummies, then we can absorb y, into Xz, 
and we assume that here. Then the sufficient condition for consistency of IV estima- 
tors applied to the transformed equations is E[Z},(ri: + tir)] = 0, and this condition is 
met under the maintained assumptions. In other words, under (11.89) and (11.90), 
the FE 2SLS estimator is consistent for the average population effect, $. (Remember, 
we use “‘fixed effects” here in the general sense of eliminating the unit-specific trends, 
a;.) We must remember to include a full set of time period dummies if we want to 
apply this robustness result, something that should be done in any case. Naturally, 
we can also use GMM to obtain a more efficient estimator. If b; truly depends on i, 
then the composite error rj, + ii is likely serially correlated and heteroskedastic. See 
Murtazashvili and Wooldridge (2008) for further discussion and simulation results on 
the peformance of the FE2SLS estimator. They also provide examples where the key 
assumptions cannot be expected to hold, such as when endogenous elements of x; are 
discrete. 


11.7.4 Testing for Correlated Random Slopes 


The findings of the previous subsection suggest that standard panel data methods 
that remove unit-specific heterogeneity have satisfying robustness properties for esti- 
mating the population average effects. Nevertheless, in some cases we want to know 
whether there is evidence for heterogeneous slopes. 

We focus on the model (11.82), first under assumption (11.83). We could consider 
cases where c; is assumed to be uncorrelated with x;, but, in keeping with the spirit of 
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the previous subsection, we allow c; to be correlated with x;. Then, allowing for the 
presence of c;, the goal is effectively to test Var(b;) = Var(d;) = 0. Unfortunately, 
without additional assumptions, it is not possible to test Var(d;) = 0, even if we are 
very specific about the alternative. Suppose we specify as the alternative 


Var(b; | x;) = Var(b;) = A, (11.92) 


so that the conditional variance is not a function of x;. (We also assume E(b; | x;) = 
E(b;) under the alternative.) Unfortunately, even assumption (11.92) is not enough to 
proceed. We need to restrict the conditional variance matrix of the idiosyncratic 
errors, and the simplest (and most common) assumption is 


Var(u; | x;, ci, bi) = o2Ir, (11.93) 


which is the natural extension of Assumption FE.3 from Chapter 10. Along with 
(11.83), these assumptions allow us to test Var(b;) = 0. To see why, write the time- 
demeaned equation as y,; = X;f + ¥;, where 


E(w; | x;) = X;E(d; | x,) + E(ü; | x;) = 0, 
Var(¥;|x;) = X;AX/ + o2Mr, 


and Mr = Ir —j7(i;i7) 'įr. The last equation shows that, under the maintained 
assumptions, Var(¥;|x;) does not depend on X; if A= 0. If A 40, then the com- 
posite error in the time-demeaned equation generally exhibits heteroskedasticity and 
serial correlation that are quadratic functions of the time-demeaned regressors. So, 
the method would be to estimate f} by standard FE methods, obtain the FE residuals, 
and then test whether the variance matrix is a quadratic function of the Xx. 

The main problem with the previous test is that it associates system hetero- 
skedasticity—that is, variances and covariances depending on the regressors—with 
the presence of “random” coefficients. But if b; = # and Var(u; | x;,c;) depends on x;, 
Var(v;|x;) generally depends on x;. In other words, there is no convincing way to 
distinguish system heteroskedasticity in Var(u; | x;, c;) from nonconstant b;. 

Rather than try to test whether Var(b;) 4 0, we can instead test whether b; varies 
with observable variables; that is, we can test 


Ho : E(b;|x;) = E(b;), (11.94) 


when the covariates satisfy the strict exogeneity assumption (11.83). A sensible alter- 
native is that E(b;|x;) depends with the time averages, something we can capture 
with 
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b; =a+Tx;+di, (11.95) 


where a is K x 1 and T is K x L. Under the null hypothesis, F = 0, and then a = $. 
Explicitly allowing for aggregate time effects gives 


sl 
Vit = Qr + Ci + Xia + XTX; + Xidi + tir 


= 0, + Ci T Xia + (Xi ® Xir) vec(T) + xd; + Uit 


= 0i + ci + Xua + (Xi Q Xir)y + Vit, (11.96) 


where vi = xd; + uj, and y = vec(T). The test of Ho : y = 0 is simple to carry out. It 
amounts to interacting the time averages with the elements of x; (or, one can choose 
a subset of x; to interact with a subset of x;,) and obtaining a fully robust test of joint 
significance in the context of FE estimation. A failure to reject means that if the b; 
vary by i, they apparently do not do so in a way that depends on the time averages of 
the covariates. The weakness of this test is that it cannot detect heterogeneity in b; 
that is uncorrelated with x;. (Like the previous test, this test is not intended to deter- 
mine whether FE is consistent for $ = E(b;).) 

We can easily allow some elements of x; to be endogenous, in which case we write 
the alternative as b; = a + [Zi + d;. Then, under the assumptions in the previous 
subsection, we estimate 


Vit = Or + ci + Xua + (Zi @ Xie) + Vir (11.97) 


by FEIV using instruments (Z;, Z; ® zi), which is 1 x L + L*. FE2SLS estimation is 
equivalent to P2SLS estimation of the equation 


Vit =p + Xua + (Zi @ Kir) + Gir (11.98) 


using instruments (Zj,Z; © Zi), where the double dot denotes time demeaning, as 
usual. This characterization is convenient for obtaining a fully robust test using soft- 
ware that computes P2SLS estimates with standard errors and test statistics robust to 
arbitrary serial correlation and heteroskedasticity. (Recall that the usual standard 
errors from (11.98) are not valid even under homoskedasticity and serial indepen- 
dence because it does not properly account for the lost degrees of freedom due to the 
time demeaning.) Again, we can be selective about what we actually include in 
(Zi ® Xx). For example, perhaps we are interested in one element, say b;, and one 
element of Z;, say Z;,. Then we would simply add the scalar Z; x; to the equation and 
compute its f statistic. Generally, if we reject Ho, we can entertain (11.97) as an al- 
ternative model, provided we make assumption (11.90). We can explicitly study how 
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the partial effects of x;, depend on Z;, and also compute the average partial effects. 
(Alternatively, we can use Z; — Z in place of Z;, and then the coefficient on coefficient 
on Xy effectively would be the APE, £ = E(b;).) We can also include other time- 
constant variables that drop out of the FE estimation when they appear by them- 
selves; we just simply interact those variables with elements of xj. 


Example 11.5 (Testing for Correlated Random Slopes in a Passenger Demand Equa- 
tion): Suppose we think the elasticity of passenger demand with respect to airfare 
differs by route, and so we specify the equation 


Ipassenj, = 0, + ci + bilfaren + ui 


and we use Zy = conceny as the instrument for /fare;,, just as in Example 11.1. To test 
whether b; is correlated with Z;, we use the equation 


Ipassenj, = 0, + ci + alfarey, + yconcen; - lfarei + viz, 


where Vi = Xudi + Ui. We estimate this equation by FEIV using concen; and 
concen; - concen; as instruments. As mentioned above, this is equivalent to applying 
pooled IV to equation (11.98), and that is how we obtain the fully robust ż statistic. 
The estimated y is very large in magnitude, 7 = —11.05, but its robust ¢ statistic is 
only —.60. We conclude that there is little evidence in this data set that b; varies with 
the time average of the route concentration. 


Problems 


11.1. Let y, denote the unemployment rate for city i at time t. You are interested in 
studying the effects of a federally funded job training program on city unemployment 
rates. Let z; denote a vector of time-constant city-specific variables that may influence 
the unemployment rate (these could include things like geographic location). Let X; 
be a vector of time-varying factors that can affect the unemployment rate. The vari- 
able progi is the dummy indicator for program participation: prog; = 1 if city i par- 
ticipated at time ¢. Any sequence of program participation is possible, so that a city 
may participate in one year but not the next. 


a. Discuss the merits of including y; ,_, in the model 
Yiu =O + Ziy + XitB + Pi Yi + Ô1 progit + Uit, t=1,2,...,T. 


State an assumption that allows you to consistently estimate the parameters by 
POLS. 
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b. Evaluate the following statement: “The model in part a is of limited value because 
the POLS estimators are inconsistent if the {u} are serially correlated.” 


c. Suppose that it is more realistic to assume that program participation depends on 
time-constant, unobservable city heterogeneity, but not directly on past unemploy- 
ment. Write down a model that allows you to estimate the effectiveness of the pro- 
gram in this case. Explain how to estimate the parameters, describing any minimal 
assumptions you need. 


d. Write down a model that allows the features in parts a and c. In other words, 
progų can depend on unobserved city heterogeneity as well as on past unemployment 
history. Explain how to consistently estimate the effect of the program, again stating 
minimal assumptions. 


11.2. Consider the following unobserved components model: 
Yit = Zity + OWir + Ci + Uit, bS R dy 


where Zy is a 1 x K vector of time-varying variables (which could include time-period 
dummies), w; is a time-varying scalar, c; is a time-constant unobserved effect, and uj; 
is the idiosyncratic error. The z; are strictly exogenous in the sense that 


Ew) =0, alls, t=1,2,...,T, (11.99) 


but c; is allowed to be arbitrarily correlated with each z;,. The variable w; is endog- 
enous in the sense that it can be correlated with uj, (as well as with c;). 


a. Suppose that T = 2, and that assumption (11.99) contains the only available 
orthogonality conditions. What are the properties of the OLS estimators of y and 6 on 
the differenced data? Support your claim (but do not include asymptotic derivations). 


b. Under assumption (11.99), still with T = 2, write the linear reduced form for the 
difference Aw; as Aw; = Zin + Z2%2 + r;, where, by construction, r; is uncorrelated 
with both z; and z2. What condition on (7,22) is needed to identify y and ô? (Hint: 
It is useful to rewrite the reduced form of Aw; in terms of Az; and, say, z;1.) How can 
you test this condition? 


c. Now consider the general T case, where we add to assumption (11.99) the as- 
sumption E(w;isuit) = 0, s < t, so that previous values of w; are uncorrelated with uir. 
Explain carefully, including equations where appropriate, how you would estimate y 
and ô. 


d. Again consider the general T case, but now use the fixed effects transformation to 
eliminate c;: 


More Topics in Linear Unobserved Effects Models 389 


Vit = Lit? + OWit + thir 

What are the properties of the IV estimators if you use Z; and wip, p > 1, as 
instruments in estimating this equation by pooled IV? (You can only use time periods 
pt+l,...,T after the initial demeaning.) 


11.3. Show that, in the simple model (11.41) with T > 2, under the assumptions 
(11.42), E(ri: | x7, ci) = 0 for all ¢, and Var(rj — 7;) and Var(x;,— X*) constant across 
t, the plim of the FE estimator is 


Var(rit — Fi) \ 


xe — x?) + Var (ri = r;)| 


plim frg = pfi Var 


N= o 
Thus, there is attenuation bias in the FE estimator under these assumptions. 


11.4. a. Show that, in the FE model, a consistent estimator of u, = E(c;) is 
fi. = N! EL P; — %iBre)- 

b. In the random trend model, how would you estimate 4, = E(gi)? 

11.5. An RE analysis of model (11.69) would add E(a;|w;,x;) = E(a;) = 
a to Assumption FE.1’ and, to Assumption FE.3’, Var(a; | w;, x;) = A, where A is a 


J x J positive semidefinite matrix. (This approach allows the elements of a; to be 
arbitrarily correlated.) 


a. Define the T x 1 composite error vector v; = W;(a; — a) + u;. Find E(v; | w;, x;) 
and Var(v; | w;, x;). Comment on the conditional variance. 


b. If you apply the usual RE procedure to the equation 
Vit = Wi + Xu + vit, t= 1,2). f° 


what are the asymptotic properties of the RE estimator and the usual RE standard 
errors and test statistics? 

c. How could you modify your inference from part b to be asymptotically valid? 
11.6. Does the measurement error model in equations (11.45) to (11.49) apply when 
wy is a lagged dependent variable? Explain. 


11.7. In the Chamberlain model in Section 11.1.2, suppose that 4, = 4/T for all t. 
Show that the POLS coefficient on x» in the regression y, on l, Xm, X; t= 
1,...,T7;i=1,...,N, is the FE estimator. (Hint: Use partitioned regression.) 


11.8. In model (11.1), first difference to remove c;: 


Ay, = Axi P+ Aug, = t=2,...,T (11.100) 
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Assume that a vector of instruments, Zy, satisfies E(Au;y |Z) = 0, t=2,...,T. 
Typically, several elements in Ax; would be included in Zz, provided they are 
appropriately exogenous. Of course the elements of zy can be arbitrarily correlated 
with Cj. 

a. State the rank condition that is necessary and sufficient for P2SLS estimation of 
equation (11.100) using instruments z;, to be consistent (for fixed T). 

b. Under what additional assumptions are the usual P2SLS standard errors and test 
statistics asymptotically valid? 


c. How would you test for first-order serial correlation in Au;,? 


11.9. Consider model (11.1) under Assumptions FEIV.1 and FEIV.2. 


a. Show that, under the additional Assumption FEIV.3, the asymptotic variance of 
VN(B — P) is of {E(K/Zi)[E(Z/Z)]'E(Z;X:)} 
b. Propose a consistent estimator of o?. 


c. Show that the 2SLS estimator of } from part a can be obtained by means of a 
dummy variable approach: estimate 


Vin = C1 dli + +++ + cy AN; + Xith + uit 


by P2SLS, using instruments (d/;,d2;,...,dN;, Zu). (Hint: Use the obvious extension 
of Problem 5.1 to P2SLS, and repeatedly apply the algebra of partial regression.) 
This is another case where, even though we cannot estimate the c; consistently with 
fixed T, we still get a consistent estimator of £. 


d. In using the 2SLS approach from part c, explain why the usually reported stan- 
dard errors are valid under Assumption FEIV.3. 


e. How would you obtain valid standard errors for 2SLS without Assumption 
FEIV.3? 


11.10. Consider the general model (11.69), where unobserved heterogeneity inter- 
acts with possibly several variables. Show that the FE estimator of f is also obtained 
by running the regression 


Vit on dliWi, d2;Wit,...,ANiWit, Xit, PS Ly Qe Ee ay Dy eet NG (11.101) 


where dn; = 1 if and only if n = i. In other words, we interact w; in each time period 
with a full set of cross section dummies, and then include all of these terms in a POLS 
regression with x;,. You should also verify that the residuals from regression (11.101) 
are identical to those from regression (11.77), and that regression (11.101) yields 
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equation (11.76) directly. This proof extends the material on the basic dummy vari- 
able regression from Section 10.5.3. 


11.11. Apply the random growth model to the data in JTRAIN1.RAW (see Ex- 
ample 10.6): 


log(scrapi) = 0, + ci + git + By granti, + Pagranti +1 + Ui 


Specifically, difference once and then either difference again or apply fixed effects to 
the first-differenced equation. Discuss the results. 


11.12. An unobserved effects model explaining current murder rates in terms of the 
number of executions in the last three years is 


mrdrte;, = 0, + Byexeci + Paunemit + Ci + Uit, 


where mrdrte;, is the number of murders in state į during year t, per 10,000 people; 
exec is the total number of executions for the current and prior two years; and 
unem; is the current unemployment rate, included as a control. 


a. Using the data for 1990 and 1993 in MURDER.RAW, estimate this model by 
first differencing. Notice that you should allow different year intercepts. 


b. Under what circumstances would exec not be strictly exogenous (conditional on 
ci)? Assuming that no further lags of exec appear in the model and that unem is 
strictly exogenous, propose a method for consistently estimating B when exec is not 
strictly exogenous. 


c. Apply the method from part b to the data in MURDER.RAW. Be sure to also 
test the rank condition. Do your results differ much from those in part a? 


d. What happens to the estimates from parts a and c if Texas is dropped from the 
analysis? 


11.13. Use the data in PRISON.RAW for this question to estimate equation 
(11.40). 


a. Estimate the reduced form equation for Alog( prison) to ensure that final] and 
final2 are partially correlated with Alog( prison). The elements of Ax should be the 
changes in the following variables: log(polpc), log(incpc), unem, black, metro, 
ag0_14, agI5_17, ag18_24, and ag25_34. Is there serial correlation in this reduced 
form? 

b. Use Problem 11.8c to test for serial correlation in Au;. What do you conclude? 


c. Add an FE to equation (11.40). (This procedure is appropriate if we add a random 
growth term to equation (11.39).) Estimate the equation in first differences by FEIV. 
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d. Estimate equation (11.39) using the property crime rate, and test for serial cor- 
relation in Au;,. Are there important differences compared with the violent crime 
rate? 


11.14. An extension of the model in Example 11.7 that allows enterprise zone des- 
ignation to affect the growth of unemployment claims is 


log(uclms;,) = 0, + ci + git + Ô1eZi + 02eZi + t + Uit. 


Notice that each jurisdiction also has a separate growth rate gj. 


a. Use the data in EZUNEM.RAW to estimate this model by FD estimation, fol- 
lowed by FE estimation on the differenced equation. Interpret your estimate of 6. Is 
it statistically significant? 


b. Reestimate the model setting 6; = 0. Does this model fit better than the basic 
model in Example 11.4? 


c. Let w; be an observed, time-constant variable, and suppose we add piw; + bawi- t 
to the random growth model. Can either f} or f, be estimated? Explain. 


d. If we add yw; - ez, can y be estimated? 


11.15. Use the data in JT[RAIN1.RAW for this question. 


a. Consider the simple equation 
log(scrapi:) = 0, + BArsempi + Ci + uit, 


where scrap;; is the scrap rate for firm iin year t, and Arsemp;; is hours of training per 
employee. Suppose that you difference to remove c;, but you still think that AArsempi, 
and Alog(scrap;,) are simultaneously determined. Under what assumption is Agranti 
a valid IV for Ahrsemp;;? 


b. Using the differences from 1987 to 1988 only, test the rank condition for identifi- 
cation for the method described in part a. 


c. Estimate the FD equation by IV, and discuss the results. 


d. Compare the IV estimates on the first differences with the OLS estimates on the 
first differences. 


e. Use the IV method described in part a, but use all three years of data. How does 
the estimate of f; compare with only using two years of data? 


11.16. Consider a Hausman and Taylor—type model with a single time-constant 
explanatory variable: 
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Vit = YWi + Xup + Ci + uit, 
E(uir | wi, Xi, ci) = 0, t= lee T 


where xy is 1 x K vector of time-varying explanatory variables. 


a. If we are interested only in estimating f, how should we proceed, without making 
additional assumptions (other than a standard rank assumption)? 


b. Let r; be a time-constant proxy variable for c; in the sense that 
E(c | ri, wi, Xi) = E(ci | ri, Xi) = ôo + O1ri + Xið2. 


The key assumption is that, once we condition on r; and x;, w; is not partially related 
to c;. Assuming the standard proxy variable redundancy assumption E(u; | wi, Xi, Ci, 
ri) = 0, find E( y; | Wi, Xi, ri). 

c. Using part b, argue that y is identified. Suggest a pooled OLS estimator. 

d. Assume now that (1) Var(ui | wi, Xi, ci ri) = 02, t= 1,..., T; (2) COV (tir, Uis | Wi, 
Xi, Ci ri) = 0, all £ # s; (3) Var(ci | wi, Xi, ri) = o2. How would you efficiently estimate 
y (along with $, ôo, 61, and 62)? [Hint: It might be helpful to write c; = ôo + dir; + 
X;ô2 + a;, where E(a; | wi, Xi, ri) = 0 and Var(a; | wi, Xi, ri) = 02. 

11.17. Derive equation (11.81). 


11.18. Let B be the REIV estimator. 
a. Derive Avar|VN (Berry — P) without Assumption REIV.3. 


b. Show how to consistently estimate the asymptotic variance in part a. 


11.19. Use the data in AIRFARE.RAW for this exercise. 


a. Estimate the reduced forms underlying the REIV and FEIV analyses in Example 
11.1. Using fully robust ¢ statistics, is concen sufficiently (partially) correlated with 
Ifare? 

b. Redo the REIV estimation, but drop the route distance variables. What happens 
to the estimated elasticity of passenger demand with respect to fare? 


c. Now consider a model where the elasticity can depend on route distance: 
Ipassenj, = On + a lfarei + ôi ldist; + ôldist? + yı (dist; — u )lfarei 
+ y, (Idist? — p )lfarei + Cii + Uit, 


where yz, = E(ldist;) and yy = E(ldist?). The means are subtracted before forming the 
interactions so that a is the average partial effect. In using REIV or FEIV to esti- 
mate this model, what should be the IVs for the interaction terms? 
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d. Use the data in AIRFARE.RAW to estimate the model in part c, replacing 44 
and 4, with their sample averages. How do the REIV and FEIV estimates of « 
compare with the estimates in Table 11.1? 


e. Obtain fully robust standard errors for the FEIV estimation, and obtain a fully 
robust test of joint significance of the interaction terms. (Ignore the estimation of 4 
and jl.) What is the robust 95 percent confidence interval for «? 


f. Find the estimated elasticities for dist = 500 and dist = 1,500. What do you con- 
clude? 


Ill GENERAL APPROACHES TO NONLINEAR ESTIMATION 


In this part we begin our study of nonlinear econometric methods. What we mean 
by nonlinear needs some explanation because it does not necessarily mean that the 
underlying model is what we would think of as nonlinear. For example, suppose the 
population model of interest can be written as y = xf + u, but, rather than assuming 
E(u|x) = 0, we assume that the median of u given x is zero for all x. This assumption 
implies Med(y |x) = xf, which is a linear model for the conditional median of y 
given x. (The conditional mean, E(y |x), may or may not be linear in x.) The stan- 
dard estimator for a conditional median turns out to be least absolute deviations 
(LAD), not ordinary least squares. Like OLS, the LAD estimator solves a minimi- 
zation problem: it minimizes the sum of absolute residuals. However, there is a key 
difference between LAD and OLS: the LAD estimator cannot be obtained in closed 
form. The lack of a closed-form expression for LAD has implications not only for 
obtaining the LAD estimates from a sample of data, but also for the asymptotic 
theory of LAD. 

All the estimators we studied in Part II were obtained in closed form, a feature that 
greatly facilitates asymptotic analysis: we needed nothing more than the weak law of 
large numbers, the central limit theorem, and the basic algebra of probability limits. 
When an estimation method does not deliver closed-form solutions, we need to use 
more advanced asymptotic theory. In what follows, “nonlinear” describes any prob- 
lem in which the estimators cannot be obtained in closed form. 

The three chapters in this part provide the foundation for asymptotic analysis of 
most nonlinear models encountered in applications with cross section or panel data. 
We will make certain assumptions concerning continuity and differentiability, and so 
problems violating these conditions will not be covered. In the general development 
of M-estimators in Chapter 12, we will mention some of the applications that are 
ruled out and provide references. 

This part of the book is by far the most technical. We will not dwell on the some- 
times intricate arguments used to establish consistency and asymptotic normality in 
nonlinear contexts. For completeness, we do provide some general results on consis- 
tency and asymptotic normality for general classes of estimators. However, for specific 
estimation methods, such as nonlinear least squares, we will only state assumptions 
that have real impact for performing inference. Unless the underlying regularity 
conditions—which involve assuming that certain moments of the population random 
variables are finite, as well as assuming continuity and differentiability of the regres- 
sion function or log-likelihood function—are obviously false, they are usually just 
assumed. Where possible, the assumptions will correspond closely with those given 
previously for linear models. 


396 Part III 


The analysis of maximum likelihood methods in Chapter 13 is greatly simplified 
once we have given a general treatment of M-estimators. Chapter 14 contains results 
for generalized method of moments estimators for models nonlinear in parameters. 
We also briefly discuss the related topic of minimum distance estimation in Chapter 
14. 

Readers who are not interested in general approaches to nonlinear estimation 
might use the more technical material in these chapters only when needed for refer- 
ence in Part IV. 


l 2 M-Estimation, Nonlinear Regression, and Quantile Regression 


12.1 Introduction 


We begin our study of nonlinear estimation with a general class of estimators known 
as M-estimators, a term introduced by Huber (1967). (You might think of the “M” 
as standing for minimization or maximization.) M-estimation methods include max- 
imum likelihood, nonlinear least squares, least absolute deviations, quasi-maximum 
likelihood, and many other procedures used by econometricians. 

Much of this chapter is somewhat abstract and technical, but it is useful to develop 
a unified theory early on so that it can be applied in a variety of situations. We will 
carry along the example of nonlinear least squares for cross section data to motivate 
the general approach. In Sections 12.9 and 12.10, we study multivariate nonlinear 
regression and quantile regression, two practically important estimation methods. 

In a nonlinear regression model, we have a random variable, y, and we would like 
to model E(y| x) as a function of the explanatory variables x, a K-vector. We already 
know how to estimate models of E(y| x) when the model is linear in its parameters: 
OLS produces consistent, asymptotically normal estimators. What happens if the re- 
gression function is nonlinear in its parameters? 

Generally, let m(x,0) be a parametric model for E(y|x), where m is a known 
function of x and 0, and 0 is a P x 1 parameter vector. (This is a parametric model 
because m(-,0) is assumed to be known up to a finite number of parameters.) The 
dimension of the parameters, P, can be less than or greater than K. The parameter 
space, @, is a subset of IR”. This is the set of values of 0 that we are willing to con- 
sider in the regression function. Unlike in linear models, for nonlinear models the 
asymptotic analysis requires explicit assumptions on the parameter space. 

An example of a nonlinear regression function is the exponential regression func- 
tion, m(x,0) = exp(x@), where x is a row vector and contains unity as its first ele- 
ment. This is a useful functional form whenever y > 0. A regression model suitable 
when the response y is restricted to the unit interval is the logistic function, (x, 0) = 
exp(x9) /[1 + exp(x0)]. Both the exponential and logistic functions are nonlinear 
in 0. 

In any application, there is no guarantee that our chosen model is adequate for 
E(y|x). We say that we have a correctly specified model for the conditional mean, 
E(y|x), if, for some 0% € ©, 


E(y|x) = m(x, 80). (12.1) 


We introduce the subscript “o” on theta to distinguish the parameter vector appear- 
ing in E(y|x) from other candidates for that vector. (Often, the value ĝo is called 
“the true value of theta,’ a phrase that is somewhat loose but still useful as 
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shorthand.) As an example, for y>0 and a single explanatory variable x > 0, 
consider the model m(x,0) =0;x". If the population regression function is 
E(y| x) = 4x!°, then 05; = 4 and 052 = 1.5. We will never know the actual 05; and 
Oo2 (unless we somehow control the way the data have been generated), but, if the 
model is correctly specified, then these values exist, and we would like to estimate 
them. Generic candidates for 05; and 052 are labeled 0; and 02, and, without further 
information, 0; is any positive number and 6, is any real number: the parameter 
space is ®© = {(01,02): 0; > 0,02 ER}. For an exponential regression model, 
m(x,0) = exp(x@) is a correctly specified model for E(y|x) if and only if there is 
some K-vector 0) such that E(y |x) = exp(x6,). 

In our analysis of linear models, there was no need to make the distinction between 
the parameter vector in the population regression function and other candidates for 
this vector because the estimators in linear contexts are obtained in closed form, and 
so their asymptotic properties can be studied directly. As we will see, in our theoret- 
ical development we need to distinguish the vector appearing in E(y|x) from a ge- 
neric element of ©. We will often drop the subscripting by “‘o” when studying 
particular applications because the notation can be cumbersome when there are 
many parameters. 

Equation (12.1) is the most general way of thinking about what nonlinear least 
squares is intended to do: estimate models of conditional expectations. But, as a sta- 
tistical matter, equation (12.1) is equivalent to a model with an additive, unobserv- 
able error with a zero conditional mean: 


y=m(x,0.) + u, E(u|x) = 0. (12.2) 


Given equation (12.2), equation (12.1) clearly holds. Conversely, given equation 
(12.1), we obtain equation (12.2) by defining the error to be u = y — m(x, 0o). In 
interpreting the model and deciding on appropriate estimation methods, we should 
not focus on the error form in equation (12.2) because, evidently, the additivity of u 
has some unintended connotations. In particular, we must remember that, in writing 
the model in additive error form, the only thing implied by equation (12.1) is 
E(u|x) = 0. Depending on the nature of y, the error u may have some unusual 
properties. For example, if y > 0 then u > —m(x,6,), in which case u and x cannot 
be independent. Heteroskedasticity in the error—that is, Var(u|x) #4 Var(w)—is 
present whenever Var(y|x) depends on x, as is very common when y takes on a 
restricted range of values. Plus, when we introduce randomly sampled observations 
{(xi, y) : i= 1,2,..., N}, it is too tempting to write the model and its assumptions as 
“y; = M(Xi, 0o) + u; where the u; are i.i.d. errors.” As we discussed in Section 1.4 for 
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the linear model, under random sampling the {u;} are always i.i.d. What is usually 
meant is that u; and x; are independent, but, for the reasons we just gave, this as- 
sumption is often much too strong. The error form of the model does turn out to be 
useful for defining estimators of asymptotic variances and for obtaining test statistics. 

For later reference, we formalize the first nonlinear least squares (NLS) assumption 
as follows: 


ASSUMPTION NLS.1: For some 0, € ©, E(y |x) = m(x, 0). 


This form of presentation represents the level at which we will state assumptions for 
particular econometric methods. In our general development of M-estimators that 
follows, we will need to add conditions involving moments of m(x, 0) and y, as well 
as continuity assumptions on m(x, -). 

If we let w = (x, y), then 0, indexes a feature of the population distribution of w, 
namely, the conditional mean of y given x. More generally, let w be an M-vector of 
random variables with some distribution in the population. We let “% denote the 
subset of R” representing the possible values of w. Let 8, denote a parameter vector 
describing some feature of the distribution of w. This could be a conditional mean, a 
conditional mean and conditional variance, a conditional median, or a conditional 
distribution. As shorthand, we call 0, “the true parameter” or “the true value of 
theta.” These phrases simply mean that 0, is the parameter vector describing the 
underlying population, something we will make precise later. We assume that 0 
belongs to a known parameter space O c R?. 

We assume that our data come as a random sample of size N from the population; 
we label this random sample {w; : i= 1,2,...}, where each w; is an M-vector. This 
assumption is much more general than it may initially seem. It covers cross section 
models with many equations, and it also covers panel data settings with small time 
series dimension. The extension to independently pooled cross sections is almost im- 
mediate. In the NLS example, w; consists of x; and y,, the ith draw from the popu- 
lation on x and y. 

What allows us to estimate 0. when it indexes E(y| x)? It is the fact that 0, is the 
value of 0 that minimizes the expected squared error between y and m(x, 0). That is, 
0, solves the population problem 
min E{[y — m(x, 0) }, (12.3) 
Oco 
where the expectation is over the joint distribution of (x, y). This conclusion follows 
immediately from basic properties of conditional expectations (in particular, condi- 
tion CE.8 in Chapter 2). We will give a slightly different argument here. Write 
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[y — m(x, 0)? = |y — m(x, bo) + 2[m(x, 0.) — m(x, 0)]u 
+ [m(x, 05) — m(x, 0)’, (12.4) 


where u is defined in equation (12.2). Now, since E(w| x) = 0, u is uncorrelated with 
any function of x, including m(x, 0.) — m(x,0). Thus, taking the expected value of 
equation (12.4) gives 


E{[y — m(x, 0)]7} = E{[y — m(x, 00)]?} + E{[m(x, 0.) — m(x, 0)]°}. (12.5) 
Since the last term in equation (12.5) is nonnegative, it follows that 
E{[y — m(x, 0)]7} > E{[y — m(x, 0) }, — all0cO. (12.6) 


The inequality is strict when 0 4 0, unless E{|m(x, 0.) — m(x, 0)|7} = 0; for 0, to be 
identified, we will have to rule this possibility out. 

Because 0, solves the population problem in expression (12.3), the analogy 
principle—which we introduced in Chapter 4—suggests estimating 0, by solving the 
sample analogue. In other words, we replace the population moment E{[|( y — m(x, 0)|*} 
with the sample average. The NLS estimator of 0,, 6, solves 


N 
in NW! .— m(x;,0)]’. 12.7 
min DL m(x;,)| (12.7) 
For now, we assume that a solution to this problem exists. 

The NLS objective function in expression (12.7) is a special case of a more general 
class of estimators. Let g(w, 0) be a function of the random vector w and the parameter 
vector 0. An M-estimator of 0, solves the problem 


N 

. -1 
min N 2 al), (12.8) 
assuming that a solution, call it Ê, exists. The estimator clearly depends on the sample 
{w; : i= 1,2,..., N}, but we suppress that fact in the notation. 

The objective function for an M-estimator is a sample average of a function of 
w; and 0. The division by N, while needed for the theoretical development, does not 
affect the minimization problem. Also, the focus on minimization rather than maxi- 
mization is without loss of generality because maximization can be trivially turned 
into minimization. 

The parameter vector 0o is assumed to uniquely solve the population problem 


min Elq(w, 9). (12.9) 
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Comparing equations (12.8) and (12.9), we see that M-estimators are based on the 
analogy principle. Once 0, has been defined, finding an appropriate function q that 
delivers 0, as the solution to problem (12.9) requires basic results from probability 
theory. Often there is more than one choice of q such that 0, solves problem (12.9), in 
which case the choice depends on efficiency or computational issues. For the next 
several sections, we carry along the NLS example; we treat maximum likelihood 
estimation in Chapter 13. 

How do we translate the fact that 0, solves the population problem (12.9) into 
consistency of the M-estimator Ê that solves problem (12.8)? Heuristically, the argu- 
ment is as follows. Since for each 0 e ©, {q(w;,0) : i= 1,2,...} is just an iid. se- 
quence, the law of large numbers implies that 


No! 5 q(wi, 0) > Elq(w, )] (12.10) 
=l 


under very weak finite moment assumptions. Since Ê minimizes the function on the 
left side of equation (12.10) and 0, minimizes the function on the right, it seems 
plausible that 6 2, Ø.. This informal argument turns out to be correct, except in 
pathological cases. There are essentially two issues to address. The first is identifi- 
ability of 0., which is purely a population issue. The second is the sense in which the 
convergence in equation (12.10) happens across different values of 0 in ©. 


12.2 Identification, Uniform Convergence, and Consistency 


We now present a formal consistency result for M-estimators under fairly weak 
assumptions. As mentioned previously, the conditions can be broken down into two 
parts. The first part is the identification or identifiability of 0.. For nonlinear regres- 
sion, we showed how @, solves the population problem (12.3). However, we did not 
argue that 0, is always the unique solution to problem (12.3). Whether or not this is 
the case depends on the distribution of x and the nature of the regression function: 


ASSUMPTION NLS.2: Ef [m(x, 0.) — m(x,9)]”} > 0, all 0 € O, 0 £ 0o. 


Assumption NLS.2 plays the same role as Assumption OLS.2 in Chapter 4. It can 
fail if the explanatory variables x do not have sufficient variation in the population. 
In fact, in the linear case m(x, 0) = x0, Assumption NLS.2 holds if and only if rank 
E(x'x) = K, which is just Assumption OLS.2 from Chapter 4. In nonlinear models, 
Assumption NLS.2 can fail if m(x, 0o) depends on fewer parameters than are actually 
in 0. For example, suppose that we choose as our model m(x, 0) = 0; + 0x2 + 03x%4, 
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but the true model is linear in x2: 03 = 0. Then E[(y — m(x, 0))]? is minimized for 
any 0 with 0; = 051, 02 = 902, 03 = 0, and 04 any value. If 0.3 #0, Assumption 
NLS.2 would typically hold provided there is sufficient variation in x and x3. Be- 
cause identification fails for certain values of 0o, this is an example of a poorly iden- 
tified model. (See Section 9.5 for other examples of poorly identified models.) 
Identification in commonly used nonlinear regression models, such as exponential 
and logistic regression functions, holds under weak conditions, provided perfect col- 
linearity in x can be ruled out. For the most part, we will just assume that, when the 
model is correctly specified, 0, is the unique solution to problem (12.3). For the 
general M-estimation case, we assume that q(w, 0) has been chosen so that 0, is a 
solution to problem (12.9). Identification requires that 0) be the unique solution: 


E[q(w, 00)] < E[g(w, 0)], aldeO, OFA. (12.11) 


The second component for consistency of the M-estimator is convergence of 
the sample average N7! pe , 9(wi, 0) to its expected value. It turns out that point- 
wise convergence in probability, as stated in equation (12.10), is not sufficient for 
consistency. That is, it is not enough to simply invoke the usual weak law of large 
numbers at each 0 e ©. Instead, uniform convergence in probability is sufficient. 
Mathematically, 


N 
-1 p 
max|N Da — Elq(w, 0)]|| > 0. (12.12) 
Uniform convergence clearly implies pointwise convergence, but the converse is not 
true: it is possible for equation (12.10) to hold but equation (12.12) to fail. Never- 
theless, under certain regularity conditions, the pointwise convergence in equation 
(12.10) translates into the uniform convergence in equation (12.12). 

To state a formal result concerning uniform convergence, we need to be more 
careful in stating assumptions about the function q(-,-) and the parameter space ©. 
Since we are taking expected values of q(w, 0) with respect to the distribution of w, 
q(w,@) must be a random variable for each 0 e ©. Technically, we should assume 
that q(-,0) is a Borel measurable function on W for each 0 € @. Since it is very diff- 
cult to write down a function that is not Borel measurable, we spend no further time 
on it. Rest assured that any objective function that arises in econometrics is Borel 
measurable. You are referred to Billingsley (1979) and Davidson (1994, Chap. 3). 

The next assumption concerning q is practically more important. We assume that, 
for each we W, q(w,-) is a continuous function over the parameter space ©. All of 
the problems we treat in detail have objective functions that are continuous in the 
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parameters, but these do not cover all cases of interest. For example, Manski’s (1975) 
maximum score estimator for binary response models has an objective function that 
is not continuous in 0. (We cover binary response models in Chapter 15.) It is possi- 
ble to somewhat relax the continuity assumption in order to handle such cases, but 
we will not need that generality. See Manski (1988, Sect. 7.3) and Newey and 
McFadden (1994). 

Obtaining uniform convergence is generally difficult for unbounded parameter sets, 
such as @ = R”. It is easiest to assume that @ is a compact subset of IR’, which 
means that © is closed and bounded (see Rudin, 1976, Theorem 2.41). Because the 
natural parameter spaces in most applications are not bounded (and sometimes not 
closed), the compactness assumption is untidy for developing a general theory of es- 
timation. However, for most applications it is not an assumption to worry about: © 
can be defined to be such a large closed and bounded set as to always contain 6o. 
Some consistency results for nonlinear estimation without compact parameter spaces 
are available; see the discussion and references in Newey and McFadden (1994). 

We can now state a theorem concerning uniform convergence appropriate for the 
random sampling environment. This result, known as the uniform weak law of large 
numbers (UWLLN), dates back to LeCam (1953). See also Newey and McFadden 
(1994, Lemma 2.4). 


THEOREM 12.1 (Uniform Weak Law of Large Numbers): Let w be a random vector 
taking values in YW c R™, let © be a subset of IR”, and let q : W x @ — R bea real- 
valued function. Assume that (a) © is compact; (b) for each 0 € ©, q(-,0) is Borel 
measurable on W; (c) for each w € W, g(w, -) is continuous on ©; and (d) |g(w, 0)| < 
b(w) for all 0 € ©, where b is a nonnegative function on W such that E[b(w)] < o. 
Then equation (12.12) holds. 


The only assumption we have not discussed is assumption d, which requires the 
expected absolute value of g(w,@) to be bounded across 0. This kind of moment 
condition is rarely verified in practice, although it can be. For example, for NLS with 
0 < y <1 and the mean function such that 0 < m(x,@) < 1 for all x and 0, we can 
take b(w) = 1. See Newey and McFadden (1994) for more complicated examples. 
The continuity and compactness assumptions are important for establishing uni- 
form convergence, and they also ensure that both the sample minimization problem 
(12.8) and the population minimization problem (12.9) actually have solutions. Con- 
sider problem (12.8) first. Under the assumptions of Theorem 12.1, the sample average 
is a continuous function of 0, since q(w;, 0) is continuous for each w;. Since a continu- 
ous function on a compact space always achieves its minimum, the M-estimation 
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problem is well defined (there could be more than one solution). As a technical mat- 
ter, it can be shown that Ê is actually a random variable under the measurability as- 
sumption on q(: , 0). See, for example, Gallant and White (1988). 

It can also be shown that, under the assumptions of Theorem 12.1, the function 
E[q(w, 8)] is continuous as a function of 0. Therefore, problem (12.9) also has at least 
one solution; identifiability ensures that it has only one solution, which in turn implies 
consistency of the M-estimator. 


THEOREM 12.2 (Consistency of M-Estimators): Under the assumptions of Theorem 
12.1, assume that the identification assumption (12.11) holds. Then a random vector, 
0, solves problem (12.8), and 6 2, Ø. 


A proof of Theorem 12.2 is given in Newey and McFadden (1994). For nonlinear 
least squares, once Assumptions NLS.1 and NLS.2 are maintained, the practical re- 
quirement is that m(x, -) be a continuous function over ©. Since this assumption is 
almost always true in applications of NLS, we do not list it as a separate assumption. 
Noncompactness of © is not much of a concern for most applications. 

Theorem 12.2 also applies to median regression. Suppose that the conditional 
median of y given x is Med(y| x) = m(x, 0o), where m(x, 0) is a known function of x 
and 0. The leading case is a linear model, m(x, 0) = x0, where x contains unity. The 
least absolute deviations (LAD) estimator of 6, solves 


N 
Ad 
J. — i0 š 
min N 2l»; m(x;, 0)| 


If © is compact and m(x, -) is continuous over © for each x, a solution always exists. 
The LAD estimator is motivated by the fact that 0, minimizes E]| y — m(x, 8)|] over 
the parameter space ©; this follows by the fact that for each x, the conditional median 
is the minimum absolute loss predictor conditional on x. (See, for example, Bassett 
and Koenker, 1978, and Manski, 1988, Sect. 4.2.2.) If we assume that ĝo is the 
unique solution—a standard identification assumption—then the LAD estimator is 
consistent very generally. In addition to the continuity, compactness, and identifica- 
tion assumptions, it suffices that E||y|] < œ and |m(x, 60)| < a(x) for some function 
a(-) such that Efa(x)] < œ. (To see this point, take b(w) = |y| + a(x) in Theorem 
12.2.) 

Median regression is a special case of quantile regression, where we model quantiles 
in the distribution of y given x. For example, in addition to the median, we can es- 
timate how the first and third quartiles in the distribution of y given x change with x. 
Except for the median (which leads to LAD), the objective function that identifies a 
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conditional quantile is asymmetric about zero. We study quantile regression in Sec- 
tion 12.10. 

We end this section with a lemma that we use repeatedly in the rest of this chapter. 
It follows from Lemma 4.3 in Newey and McFadden (1994). 


LEMMA 12.1: Suppose that Ô @,, and assume that r(w,0) satisfies the same 
assumptions on q(w, 0) in Theorem 12.2. Then 


Nu! Sh 2, E[r(w, 00)]. (12.13) 
i=l 


That is, N~! YOA; r(w;, Ô) is a consistent estimator of E[r(w, 8o)]. 


Intuitively, Lemma 12.1 is quite reasonable. We know that N7! peu 11 (Wi, 0o) gen- 
erally converges in probability to E[r(w,0,)| by the law of large numbers. Lemma 
12.1 shows that, if we replace 0, in the sample average with a consistent estimator, 
the convergence still holds, at least under standard regularity conditions. In fact, as 
shown by Newey and McFadden (1994), we need only assume r(w, 0) is continuous 
at 0, with probability one. 


12.3 Asymptotic Normality 


Under additional assumptions on the objective function, we can also show that M- 
estimators are asymptotically normally distributed (and converge at the rate VN). 
It turns out that continuity over the parameter space does not ensure asymptotic 
normality. 

The simplest asymptotic normality proof proceeds as follows. Assume that 0, is in 
the interior of ©, which means that © must have nonempty interior; this assumption 
is true in most applications. Then, since Ê 2 Oa, Ô is in the interior of © with prob- 
ability approaching one. If q(w, -) is continuously differentiable on the interior of ©, 
then (with probability approaching one) 6 solves the first-order condition 


N 
S| s(w;, ô) = 0, (12.14) 
i=l 

where s(w,0) is the Px 1 vector of partial derivatives of q(w,0) :s(w,0)' = 
Voq(w, 0) = [ôq(w, 0) /00,, ôq(w,0)/ô02,...,ôq(w,0)/ô0p]. (That is, s(w,0) is the 
transpose of the gradient of g(w, 0).) We call s(w, 0) the score of the objective function 
q(w, 8). While condition (12.14) can only be guaranteed to hold with probability 
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approaching one, usually it holds exactly; at any rate, we will drop the qualifier, as it 
does not affect the derivation of the limiting distribution. 

If g(w, -) is twice continuously differentiable, then each row of the left-hand side of 
equation (12.14) can be expanded about 0, in a mean-value expansion: 


N N N 
XC s(wi, ô) = X` s(wi, 8 J (So) (Ê — 0%). (12.15) 
i=l i=l i=l 
The notation H; denotes the P x P Hessian of the objective function, q(w;, 0), with 
respect to 0, but with each row of H(w;, 0) = ô°q(w;, 0)/0000' = V3q(w;, 0) evaluated 
at a different mean value. Each of the P mean values is on the line segment between 
0, and Ô. We cannot know what these mean values are, but we do know that each 
must converge in probability to 0, (since each is “trapped” between Ê and 0,). 
Combining equations (12.14) and (12.15) and multiplying through by 1//N gives 


0=Nn 1? s s(w;, 0o) + (x SH) VN(0—9,). 
i=l i=1 


Now, we can apply Lemma 12.1 to get M`! DAH 2, PER Y, Pal under some 
moment conditions). If Ao = E[H(w, 8o)] is honsingnlar, then N~! >", Ë; is non- 
singular w.p.a.1 and (N~! >", Ë)! % A71. Therefore, we can write 


VN(Ô— 80) aC aSa) -aest 


where s;(0o) = s(w;, 0o). As we will show, Efs;(8%)] = 0. Therefore, N~!/? 5A] s;(00) 
generally satisfies the central limit theorem because it is the average of i.i.d. random 
vectors with zero mean, multiplied by the usual VN. Since 0,(1) -O,(1) = op(1), we 
have 


VN(0— 0) = wry a 


This is an important equation. It shows that VN (Ê — 0.) inherits its limiting distri- 
bution from the average of the scores, evaluated at 0o. The matrix AS! simply acts as 
a linear transformation. If we absorb this linear transformation into s;(0,.), we can 
write 


+ (1). (12.16) 


VN(6— 0) = N-12 S e1(8.) + 0,(1), (12.17) 
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where e;(0,.) = -A7 's;(0o); this is sometimes called the influence function representa- 
tion of Ô, where e(w, 0) is the influence function. 

Equation (12.16) (or (12.17)) allows us to derive the first-order asymptotic distri- 
bution of Ê. Higher order representations attempt to reduce the error in the o,(1) 
term in equation (12.16); such derivations are much more complicated than equation 
(12.16) and are beyond the scope of this book. 

We have essentially proved the following result: 


THEOREM 12.3 (Asymptotic Normality of M-Estimators): In addition to the assump- 
tions in Theorem 12.2, assume (a) ĝo is in the interior of ©; (b) s(w,-) is continu- 
ously differentiable on the interior of © for all we W; (c) Each element of H(w, 0) 
is bounded in absolute value by a function b(w), where E[b(w)] < œ; (d) Ay = 
E|[H(w, 0o)] is positive definite; (e) E[s(w, 0,.)] = 0; and (f) each element of s(w, 0o) 
has finite second moment. 


Then 

VN(6 — 05) “+ Normal(0, Aj!ByAo!), (12.18) 
where 

A, = E[H(w, 0,)] (12.19) 
and 

B, = E[s(w, 0,)s(w, 9.) '] = Var[s(w, 0%)]. (12.20) 
Thus, 

Avar(ĝ) = A>'B,Ao!/N. (12.21) 


Theorem 12.3 implies asymptotic normality of most of the estimators we study in 
the remainder of the book. A leading example that is not covered by Theorem 12.3 is 
the LAD estimator. Even if m(x, 0) is twice continuously differentiable in 0, the ob- 
jective function for each i, g(w;,0) = |y; — m(x;,)|, is not twice continuously differ- 
entiable because the absolute value function is nondifferentiable at zero. By itself, 
nondifferentiability at zero is a minor nuisance. More important, by any reasonable 
definition, the Hessian of the LAD objective function is the zero matrix in the leading 
case of a linear conditional median function, and this feature violates assumption d 
of Theorem 12.3. It turns out that the LAD estimator is generally V N-asymptotically 
normal, but Theorem 12.3 cannot be applied. We discuss the asymptotic distribution 
of LAD and other quantile estimators in Section 12.10. 
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A key component of Theorem 12.3 is that the score evaluated at 0, has expected 
value zero. In many applications, including NLS, we can show this result directly. 
But it is also useful to know that it holds in the abstract M-estimation framework, at 
least if we can interchange the expectation and the derivative. To see this point, note 
that, if 0, is in the interior of ©, and E[g(w, 0)] is differentiable for 0 € int ©, then 


WEl4(W, D)]lp-o, = 0. (12.22) 


Now, if the derivative and expectations operator can be interchanged (which is the 
case quite generally), then equation (12.22) implies 


E[Voq(w, 0)] = E[s(w, 0,)| = 0. (12.23) 


A similar argument shows that, in general, E[H(w, 0,)] is positive semidefinite. If 0, 
is identified, E[H(w, 0,)| is positive definite. 

For the remainder of this chapter, it is convenient to divide the original NLS ob- 
jective function by two: 


g(w, 0) = [y — m(x, 0)}? /2. (12.24) 
The score of equation (12.24) can be written as 
s(w, 0) = —Vom(x, 0)'[y — m(x, 0)] (12.25) 


where Vọm(x, 0) is the 1 x P gradient of m(x, 0), and therefore Vgm/(x, 0)’ is P x 1. 
We can show directly that this expression has an expected value of zero at 0 = 0, by 
showing that expected value of s(w, 0o) conditional on x is zero: 


E[s(w, 85) | x] = —Vom(x, 0o) '[E(y | x) — m(x, 0)] = 0. (12.26) 
The variance of s(w, 0o) is 
Bo = E[s(w, 0o )s(w, 05)'] = E[u?Vom(x, 0o) Vom(x, 0o)], (12.27) 


where the error u = y — m(x, 0o) is the difference between y and E(y |x). 
The Hessian of q(w, 0) is 


H(w, 0) = Vom(x, 0)'Vom(x, 0) — Vam(x, 0)[y — m(x, 0)], (12.28) 


where Vjm(x,0) is the P x P Hessian of m(x,0) with respect to 0. To find the 
expected value of H(w,0) at 0 = 0s, we first find the expectation conditional on x. 
When evaluated at 0o, the second term in equation (12.28) is Vĝm(x, 0o)u, and it 
therefore has a zero mean conditional on x (since E(u |x) = 0). Therefore, 


E[H(w, 0) | x] = Vom(x, 05)’ Vom(x, 0o). (12.29) 
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Taking the expected value of equation (12.29) over the distribution of x gives 
Ao = E[Vam(x, 0o) Vom(x, 90). (12.30) 


This matrix plays a fundamental role in nonlinear regression. When @, is identified, 
Ag is generally positive definite. In the linear case m(x, 0) = x0, A, = E(x’x). In the 
exponential case m(x, 0) = exp(x@), Ao = Elexp(2x0.)x’x], which is generally posi- 
tive definite whenever E(x’x) is. In the example m(x, 6) = 01 + 02x2 + bsx” with 
0.3 = 0, it is easy to show that matrix (12.30) has rank less than four. 

For nonlinear regression, Ao and B, appear to be similar in that they both depend 
on Vọm(x, 0o) 'Vom(x, 05). Generally, though, there is no simple relationship between 
A, and B, because the latter depends on the distribution of u?, the squared popula- 
tion error. In Section 12.5 we will show that a homoskedasticity assumption implies 
that B, is proportional to Ao. 

The previous analysis of nonlinear regression assumes that the conditional mean 
function is correctly specified. If we drop this assumption, then we cannot simplify 
the expected Hessian as in equation (12.30). White (1981, 1994) studies the properties 
of NLS when the mean function is misspecified. Under weak conditions, the NLS 
estimator 6 converges in probability to a vector 0*, where 0* provides the best 
mean square approximation to the actual regression function, E(y;|x; = x). More 
precisely, E{[E(,;|x;) — m(x;,0*)]7} < E{[E(y;|x;) — m(x;,0)|°} for 0€@. The 
asymptotic normality result in equation (12.18) still holds (with the obvious nota- 
tional changes). But the second term in equation (12.28), evaluated at 6”, is no longer 
guaranteed to be zero: Vjm/(x;,0*) is generally correlated with y;— m(x;,0*), be- 
cause E(y;|x;) #m(x;,0"). We have focused on the case of correctly specified 
conditional mean functions, but one should be aware that the assumption has impli- 
cations for estimating the asymptotic variance, a topic we return to in Section 12.5. 


12.4 Two-Step M-Estimators 


Sometimes applications of M-estimators involve a first-stage estimation (an example 
is OLS with generated regressors, as in Chapter 6). Let ) be a preliminary estimator, 
usually based on the random sample {w; :i=1,2,...,N}. Where this estimator 
comes from must be vague for the moment. 

A two-step M-estimator 6 of 0, solves the problem 


N 


min > 4(Wi 959), (12.31) 


i= 
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where q is now defined on W x © xT, and I is a subset of IR’. We will see several 
examples of two-step M-estimators in Chapter 13 and the applications in Part IV. 
An example of a two-step M-estimator is the weighted nonlinear least squares (WNLS) 
estimator, where the weights are estimated in a first stage. The WNLS estimator 
solves 


lv; — m(x, 9)" /A(xi, 9), (12.32) 


M= 


min(1/2) 


ll 
a 


where the weighting function, /(x,y), depends on the explanatory variables and a 
parameter vector. As with NLS, m(x, 0) is a model of E(y |x). The function A(x, y) is 
chosen to be a model of Var(y|x). The estimator 7 comes from a problem used to 
estimate the conditional variance. We list the key assumptions needed for WNLS to 
have desirable properties here, but several of the derivations are left for the problems. 


ASSUMPTION WNLS.1: Same as Assumption NLS.1. 
12.4.1 Consistency 


For the general two-step M-estimator, when will Ê be consistent for 0,? In practice, 
the important condition is the identification assumption. To state the identification 
condition, we need to know about the asymptotic behavior of 7. A general assump- 
tion is that  & y*, where y* is some element in T. We label this value y* to allow for 
the possibility that 7 does not converge to a parameter indexing some interesting 
feature of the distribution of w. In some cases, the plim of ĵ will be of direct interest. 
In the weighted regression case, if we assume that A(x, y) is a correctly specified model 
for Var(y|x), then it is possible to choose an estimator such that 7 y,, where 
Var(y|x) = A(x, y,). (For an example, see Problem 12.2.) If the variance model is 
misspecified, plim 7 is generally well defined, but Var(y|x) 4 A(x, y*); it is for this 
reason that we use the notation y*. 
The identification condition for the two-step M-estimator is 


E[q(w, 80; y")] < Elg(w, 0; y*)], aldc@O, 0446. 


The consistency argument is essentially the same as that underlying Theorem 12.2. If 
q(w;, 0; y) satisfies the UWLLN over © x T then expression (12.31) can be shown to 
converge to E[g(w, 0; y*)] uniformly over ©. Along with identification, this result can 
be shown to imply consistency of 6 for Oo. 

In some applications of two-step M-estimation, identification of 0, holds for any 
y €T. This result can be shown for the WNLS estimator (see Problem 12.4). It is for 
this reason that WNLS is still consistent even if the function h(x, y) is not correctly 
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specified for Var(y|x). The weakest version of the identification assumption for 
WNLS is the following: 


ASSUMPTION WNLS.2:  E{[m(x, 0.) — m(x,0)|"/h(x,y*)} > 0, all 0€@, 040, 
where y* = plim 9. 


As with the case of NLS, we know that weak inequality holds in Assumption 
WNLS.2 under Assumption WNLS.1. The strict inequality in Assumption WNLS.2 
puts restrictions on the distribution of x and the functional forms of m and h. 

In other cases, including several two-step maximum likelihood estimators we en- 
counter in Part IV, the identification condition for 0o holds only for y = y* = yo, 
where y, also indexes some feature of the distribution of w. 


12.4.2 Asymptotic Normality 


With the two-step M-estimator, there are two cases worth distinguishing. The first 
occurs when the asymptotic variance of VN (Ô — 05) does not depend on the asymp- 
totic variance of VN (9 — y*), and the second occurs when the asymptotic variance of 
VN(0— 0.) should be adjusted to account for the first-stage estimation of y*. We 
first derive conditions under which we can ignore the first-stage estimation error. 

Using arguments similar to those in Section 12.3, it can be shown that, under 
standard regularity conditions, 


VN(0— 05) = Ay! (= 3 si(Oo; ») + 0,(1), (12.33) 


i=1 


where now A, = E[H(w, 4); y*)]. In obtaining the score and the Hessian, we take 
derivatives only with respect to 0; y* simply appears as an extra argument. Now, if 


N N 

NOW? X 18i(80;9) = N $ 8i(8o; 7") + O(1), ee) 
i=l i=l 

then VN (Ê — 0.) behaves the same asymptotically whether we used 9 or its plim in 

defining the M-estimator. 


When does equation (12.34) hold? Assuming that VN(j— y*) = O,(1), which is 
standard, a mean value expansion similar to the one in Section 12.3 gives 


N N 
N7'?S"'s:(00;9) = N X 8:(80; 7") +F VNG — 9") + op(1), (12.35) 
i=1 i=l 


where F, is the P x J matrix 
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F, = E[V,s(w, 90; 9*)]- (12.36) 
(Remember, J is the dimension of y.) Therefore, if 
E[V,s(w, 5; 9")] = 0, (12.37) 


then equation (12.34) holds, and the asymptotic variance of the two-step M-estimator 
is the same as if y* were plugged in. In other words, under assumption (12.37), we 
conclude that equation (12.18) holds, where Ao and B, are given in expressions 
(12.19) and (12.20), respectively, except that y* appears as an argument in the score 
and Hessian. For deriving the asymptotic distribution of VN (6 — 0,), we can ignore 
the fact that ĵ was obtained in a first-stage estimation. 

One case where assumption (12.37) holds is weighted nonlinear least squares, 
something you are asked to show in Problem 12.4. Naturally, we must assume that 
the conditional mean is correctly specified, but, interestingly, assumption (12.37) 
holds whether or not the conditional variance is correctly specified. 

There are many problems for which assumption (12.37) does not hold, including 
some of the methods for correcting for endogeneity in probit and Tobit models in Part 
IV. In Chapter 21 we will see that two-step methods for correcting sample selection 
bias are two-step M-estimators, but assumption (12.37) fails. In such cases we need to 
make an adjustment to the asymptotic variance of VN(Ê-— 8»). The adjustment is 
easily obtained from equation (12.35), once we have a first-order representation for 
VN(p — y*). We assume that 


VN (9 —y*) wv ng )+ (1 (12.38) 


where r;(y*) isa J x 1 vector with Efr;(y*)] = 0 (in practice, r; depends on parameters 
other than y*, but we suppress those here for simplicity). Therefore, 7 could itself be 
an M-estimator or, as we will see in Chapter 14, a generalized method of moments 
estimator. In fact, every estimator considered in this book has a representation as in 
equation (12.38). 

Now we can write 


VN(6— 0.) = AZ! N De —g;(8o; »*)] + 0,(1), (12.39) 
where g;(0,; 7") = $;(00; y*) + For;(y*). Since g;(00; y*) has zero mean, the standard- 


ized partial sum in equation (12.39) can be assumed to satisfy the central limit theorem. 
Define the P x P matrix 
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Do = Elg;(0; 7*)g;(90;7*)'] = Varlg;(80; 7*)]. (12.40) 
Then 
Avar VN(6— 05) = AZ 'D,A3!. (12.41) 


We will discuss estimation of this matrix in the next section. 

Sometimes it is informative to compare the correct asymptotic variance in (12.41) 
with the asymptotic variance one would obtain by ignoring the sampling error in 9; 
that is, by using the incorrect formula in equation (12.18). (If F, = 0, both formulas 
are the same.) Most often, the concern is that ignoring estimation of y* leads one to 
be too optimistic about the precision in Ô, which happens when D, — Bo is positive 
semidefinite (p.s.d.). One case where D, — B, is unambiguously p.s.d. (but not zero) 
is when F, 4 0 and the scores from the first- and second-step estimation are uncor- 
related: E[s;(0o; y*)r;(y*)'] = 0. Then, 


Do = Els;(90; *) E[Si(9o; 9” )'] + FoE[ri(y")ri(y*) JF, = Bo + FoE{ri(y")ri(y*) TF, 


and the last term is p.s.d. In other cases, Do — Bo is indefinite, in which case it is not 
true that the correct asymptotic variances are all larger than the incorrect ones. Per- 
haps surprisingly, it is also possible that Do — Bo is negative semidefinite. An imme- 
diate implication is that it is possible that estimating y*, rather than knowing y*, can 
actually lead to a more precise estimator of 0). Situations where estimation in a first 
stage reduces the asymptotic variance in a second stage are special, and when it 
happens, the estimator of y* is usually from a correctly specified maximum likelihood 
procedure (in which case we would use the notation y, rather than y*). Therefore, 
we postpone further discussion until the next chapter, where we explicitly cover 
maximum likelihood estimation. 


12.5 Estimating the Asymptotic Variance 


12.5.1 Estimation without Nuisance Parameters 


We first consider estimating the asymptotic variance of Ê in the case where there are 
no nuisance parameters. This task requires consistently estimating the matrices Ao 
and B,. One thought is to solve for the expected values of H(w,0,) and s(w, 0.) - 
s(w, 0o)” over the distribution of w, and then to plug in Ê for 0). When we have 
completely specified the distribution of w, obtaining closed-form expressions for A, 
and B, is, in principle, possible. However, except in simple cases, it would be difficult. 
More important, we rarely specify the entire distribution of w. Even in a maximum 
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likelihood setting, w is almost always partitioned into two parts: a set of endogenous 
variables, y, and conditioning variables, x. Rarely do we wish to specify the distri- 
bution of x, and so the expected values needed to obtain A, and B, are not available. 

We can always estimate A, consistently by taking away the expectation and 
replacing 0, with 6. Under regularity conditions that ensure uniform converge of the 
Hessian, the estimator 


N N 
NS H(w;,6) = NSH; (12.42) 
i=l i=l 

is consistent for Ao, by Lemma 12.1. The advantage of the estimator (12.42) is that it 
is always available in problems with a twice continuously differentiable objective 
function. This includes cases where 0, does not index a feature of an unconditional or 
conditional distribution (such as nonlinear regression with the conditional mean 
function misspecified). Its main drawback is computational: it requires calculation of 
second derivatives, which is a nontrivial task for certain estimation problems. 

Because we assume Ê solves the minimization problem in equation (12.8), and Ê is 
in the interior of ®©, we know that the Hessian evaluated at Ê is at least p.s.d. If the 
model is well specified in the sense that 0, is identified, then the estimator in (12.42) 
will actually be positive definite with high probability because E[H(w;, 0,)] is positive 
definite. In some cases, the objective function is strictly convex on @, in which case 
Euj H(w;, 0) is positive definite for all 0 € ©. 

In many econometric applications, more structure is available that allows a differ- 
ent estimator. Suppose we can partition w into x and y, and that 0, indexes some 
feature of the distribution of y given x (such as the conditional mean or, in the case of 
maximum likelihood, the conditional distribution). Define 


A(x, ĝo) = E[H(w, 0.) | x]. (12.43) 


While H(w, 0.) is generally a function of x and y, A(x, 0.) is a function only of x. By 
the law of iterated expectations, E[A(x, 0.)] = E[ H(w, 0.)] = Ao. From Lemma 12.1 
and standard regularity conditions it follows that 


N N 
NYO A(x, 6) = N SA; > Ao. (12.44) 
i=1 i=l 


The estimator (12.44) of Ao is useful in cases where E[H(w, 0o) | x] can be obtained in 
closed form or is easily approximated. In some leading cases, including NLS and 
certain maximum likelihood problems, A(x, @,) depends only on the first derivatives 
of the conditional mean function. 
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When the estimator (12.44) is available, it is usually the case that 0, actually min- 
imizes E[g(w, #) | x] for any value of x; from equation (12.4) this is easily seen to be 
the case for NLS with a correctly specified conditional mean function. Under 
assumptions that allow the interchange of derivative and expectation, this result 
implies that A(x, 0.) is p.s.d. The expected value of A(x, 0o) over the distribution of x 
is positive definite provided 0, is identified. Therefore, the estimator (12.44) is usually 
positive definite in the sample. 

Obtaining a p.s.d. estimator of B, is straightforward. By Lemma 12.1, under 
standard regularity conditions we have 


N 
NS © s(w;,0)s(w;,0)' = was eB A (12.45) 
i=l 


Combining the estimator (12.45) with the consistent estimators for Ao, we can con- 
sistently estimate Avar /N(0 — 0,) by 
Avar VN(6— 0.) = A7'BA“!, (12.46) 


where A is one of the estimators (12.42) or (12.44). The asymptotic standard errors 
are obtained from the matrix 


V = Avar(6) = A'BA"!/N, (12.47) 
which can be expressed as 


fn) (Es) (Ea) aa 


i=1 


(>: A) (>: ss) (>. a) (12.49) 


depending on the estimator used for A,. Expressions (12.48) and (12.49) are both at 
least p.s.d. when they are well defined. 

The variance matrix estimators in (12.48) and (12.49) are examples of a Huber- 
White sandwich estimator, after Huber (1967) and White (1982a). When they differ 
(due to H(w, 0.) # A(x, )), the estimator in equation (12.49) is usually valid only 
when some feature of the conditional distribution D(y; | x;) is correctly specified (be- 
cause otherwise the calculation of E[H(w, 0.) |x] would be incorrect). Sometimes it is 
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useful to make a distinction by calling the estimator in (12.49) a semirobust variance 
matrix estimator. The variance matrix estimator that requires the fewest assumptions 
to produce valid inference is in (12.48) (and so, for emphasis, it might be called a fully 
robust variance matrix estimator). 

In the case of NLS, the estimator of A, in equation (12.44) is always available 
when E(y;| x;) = (x;, 0o), and it is usually used if we think the mean function is 
correctly specified: 


N N 
XO Ai = X Vor Voru, 
i=l i=l 


where Voñ; = Vom(x;,0) for every observation i. Also, the estimated score for NLS 
can be written as 


S; = —Vami|y; = m(Xi, 0)| = — Voñ ût, (12.50) 


where the nonlinear least squares residuals, ĉ;, are defined as 


uy = yi — m(x;, 0). (12.51) 


The estimated asymptotic variance of the NLS estimator is 


a N al N N =l 
Avar(6) = (>: von) (>. son (>. von) (12.52) 
i=l i=l i=l 
This is called the heteroskedasticity-robust variance matrix estimator for NLS 
because it places no restrictions on Var(y |x). It was first proposed by White (1980a). 
(Sometimes the expression is multiplied by N/(N — P) as a degrees-of-freedom ad- 
justment, where P is the dimension of 0.) As always, the asymptotic standard error of 
each element of Ê is the square root of the appropriate diagonal element of matrix 
(12.52). 

As a specific example, suppose that m(x,0) = exp(x0). Then Vormi Voñ = 


exp(2x;@)x/x;, which has dimension K x K. We can plug this equation into expres- 
sion (12.52) along with a = y; — exp(x;6). 

Using the previous terminology, the estimator in (12.52) is a semirobust variance 
matrix estimator because it assumes that the conditional mean is correctly specified. 
The sense in which (12.52) is robust is that it is valid without any assumptions on 
Var(y;|x;), but it is not valid if E(y;|x;) is misspecified. If we allow mis- 
specification of m(x,@), as in White (1981), the formula (12.52) is no longer valid. 
Instead, the summands in the outer terms of (12.52) should be replaced with 
Vom(x;, 0) Vom(x;, Ô)! — Vim(x;, 8)[y; — m(x;,0)], which is simply the P x P Hessian 
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of [y; — m(x;,0)|?/2 evaluated at Ê. In most applications of NLS—and in standard 
software packages—the conditional mean is assumed to be correctly specified, and 
equation (12.52) is used as the robust variance matrix estimator. 

In many contexts, including NLS and certain quasi-likelihood methods, the 
asymptotic variance estimator can be simplified under additional assumptions. For 
our purposes, we state the assumption as follows: For some a2 > 0, 


E[s(w, 05)s(w, 9o)'] = o2E[H(w, 00)]. (12.53) 


This assumption simply says that the expected outer product of the score, evaluated 
at Oo, is proportional to the expected value of the Hessian (evaluated at 0,): Bo = 
a2A . Shortly we will provide an assumption under which assumption (12.53) holds 
for NLS. In the next chapter we will show that assumption (12.53) holds for o? = 1 in 
the context of maximum likelihood with a correctly specified conditional density. For 
reasons we will see in Chapter 13, we refer to assumption (12.53) as the generalized 
information matrix equality (GIME). 


LEMMA 12.2: Under regularity conditions of the type contained in Theorem 12.3 and 
assumption (12.53), Avar(@) = o2A,'/N. Therefore, under assumption (12.53), the 
asymptotic variance of 0 can be estimated as 


-1 
Ñ =ô? (>. i) (12.54) 


-1 
V=@ (>: 4) , (12.55) 


2 


N a aa D 
where H; and A; are defined as before, and 6? > Os. 


In the case of nonlinear regression, the parameter ož is the variance of y given x, or 
equivalently Var(u|x), under homoskedasticity: 


2 
os 


ASSUMPTION NLS.3: Var(y|x) = Var(u| x) = ø 


Under Assumption NLS.3, we can show that assumption (12.53) holds with o2 = 
Var(y |x). First, since s(w, 05)s(w, Oo)’ = u?Vgm(x, 0o) Vom(x, 0o), it follows that 


E[s(w, 8o)s(W, Ao)" |x] = E(x? | x) Vom(x, 0o) Vom(x, Bo) 
= o2Vem(x, 09) 'Vom(x, 09) (12.56) 
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under Assumptions NLS.1 and NLS.3. Taking the expected value with respect to x 
gives equation (12.53). 

Under Assumption NLS.3, a simplified estimator of the asymptotic variance of the 
NLS estimator exists from equation (12.55). Let 


2 
i 


), (12.57) 


where the #; are the NLS residuals (12.51) and SSR is the sum of squared NLS 
residuals. Using Lemma 12.1, 6? can be shown to be consistent very generally. The 
subtraction of P in the denominator of equation (12.57) is an adjustment that is 
thought to improve the small sample properties of G7. 

Under Assumptions NLS.1—NLS.3, the asymptotic variance of the NLS estimator 
is estimated as 


a =j 
ô? (>: von) (12.58) 
i=l 
This is the default asymptotic variance estimator for NLS, but it is valid only 
under homoskedasticity; the estimator (12.52) is valid with or without Assump- 
tion NLS.3. For an exponential regression function, expression (12.58) becomes 
ETA 1 exp(2x;0)x!x;) |. (Remember, if we want to allow Assumption NLS.1 to 
fail, we should use expression [12.48].) 


12.5.2 Adjustments for Two-Step Estimation 


In the case of the two-step M-estimator, we may or may not need to adjust the 
asymptotic variance. If assumption (12.37) holds, estimation is very simple. The most 
general estimators are expressions (12.48) and (12.49), where s;, H;, and A; depend on 
ĵ, but we only compute derivatives with respect to 0. 

In some cases under assumption (12.37), the analogue of assumption (12.53) holds 
(with y, = plim 7 appearing in H and s). If so, the simpler estimators (12.54) and 
(12.55) are available. In Problem 12.4 you are asked to show this result for weighted 
NLS when Var(y|x) = o2h(x,y,) and y, = plim 9. The natural third assumption for 
WNLS is that the variance function is correctly specified: 


ASSUMPTION WNLS.3: For some y, €r and o2 > 0, Var(y|x) = o2h(x,7,). Fur- 


ther, VN(p _ Yo) = O,(1). 


Under Assumptions WNLS.1-WNLS.3, the asymptotic variance of the WNLS esti- 
mator is estimated as 
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N zi 
ê? (E vov in) (12.59) 
i=l 
where h; = h(x;,ĵ) and 6? is as in equation (12.57) except that the residual a is 
replaced with the standardized residual, i; Mh; . The sum in expression (12.59) is 
simply the outer product of the weighted gradients, Vori; Whi . Thus the NLS for- 
mulas can be used but with all quantities weighted by 1 Why . It is important to re- 
member that expression (12.59) is not valid without Assumption WNLS.3. Without 
Assumption WNLS.3, but with correct specification of the conditional mean, the 
asymptotic variance estimator of the WNLS estimator is 


N lyn N =i 
(>: Vark añn ha (>. svn) (>: vom (12.60) 
i=l fel i=1 


where ù; = y; — m(x;,0) are the WNLS residuals. Notice that this estimator has the 
same structure as equation (12.52), but where the residuals are replaced with the 
standardized residuals, ů;/ Îi, and the gradients are replaced with the weighted 
gradients, Vgim;/ ĥi. 

When assumption (12.37) is violated, the asymptotic variance estimator of Ê must 
account for the asymptotic variance of 7; we must estimate equation (12.41). We 
already know how to consistently estimate A: use expression (12.42) or (12.44) 
where 7 is also plugged in. Estimation of D, is also straightforward. First, we need to 
estimate F,. An estimator that is always available is the P x J matrix 


N 
Ê= N! X V,s:(ô; ĝ). (12.61) 
fl 


In cases with conditioning variables, such as NLS, a simpler estimator can be ob- 


tained by computing E[V,s(w;, 0o, y*)|x;ļ], replacing (0o, y*) with (6,%), and using 
this in place of V,s;(0; 7). Next, replace r;(y*) with f; = r;(f). Then 


D=N') gg; (12.62) 
is consistent for Do, where g, = §; + Ft;. The asymptotic variance of the two-step M- 


estimator can be obtained as in expression (12.48) or (12.49), but where s; is replaced 
with g,. 
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In some cases, a subtle issue arises in deciding whether adjustment for the presence 
of ĵ is needed. For example, in WNLS with a correctly specified conditional mean 
(and without any assumption about the conditional variance), Fo = 0, and so no 
adjustment to the asymptotic variance of 6 is needed. But if the mean is misspecified 
so that plim(@) = 0*, where 0* minimizes the weighted mean squared error, then the 
corresponding matrix, say F* = E[V,s;(0*; y*)], is not zero. Then, neither (12.59) nor 
(12.60) is valid. Further, it is not enough to replace the expected Hessian (under cor- 
rect specification of the conditional mean) with the Hessian in (12.60). A version that 
allows the mean to be misspecified must account for the asymptotic variance of 
VN(p—y*) (which would itself require recognition that the conditional mean is 
misspecified in the initial NLS estimation). For general two-step M-estimation, we 
typically require some feature of D(y; | x;) to be correctly specified (such as the mean 
in the case of WNLS) in order to justify ignoring the sampling variation in 7. In 
practice, calculating the variance for two-step methods that allows misspecification of 
the feature of D(y;|x;) of interest is rarely done. 


12.6 Hypothesis Testing 


12.6.1 Wald Tests 


Wald tests are easily obtained once we choose a form of the asymptotic variance. To 
test the Q restrictions Ho : c(0.) = 0, we can form the Wald statistic 


W =c(0)'(CVC’)'c(6), (12.63) 


where V is an asymptotic variance matrix estimator of Ô, C = C(0), and C(0) is the 
Q x P Jacobian of c(0). The estimator V can be chosen to be fully robust, as in ex- 
pression (12.48) or (12.49); under assumption (12.53), the simpler forms in Lemma 
12.2 are available. Also, V can be chosen to account for two-step estimation, when 
necessary. Provided V has been chosen appropriately, W S Xo under Ho. 

A couple of practical restrictions are needed for W to have a limiting Xo distribu- 
tion. First, 0, must be in the interior of ©; that is, 0o cannot be on the boundary. If, 
for example, the first element of 0 must be nonnegative—and we impose this restric- 
tion in the estimation—then expression (12.63) does not have a limiting chi-square 
distribution under Hp : 04; = 0. The second condition is that C(@,.) = Voe(@.) must 
have rank Q. This rules out cases where 0, is unidentified under the null hypothesis, 
such as the NLS example where m(x, 0) = 01 + 02x. + 03x!" and 0,3 = 0 under Ho. 

One drawback to the Wald statistic is that it is not invariant to how the nonlinear 
restrictions are imposed. We can change the outcome of a hypothesis test by rede- 
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fining the constraint function, ¢(-). The easiest way to illustrate the lack of invariance 
of the Wald statistic is to use an aymptotic ¢ statistic. Just as in the classical linear 
model where the F statistic for a single restriction is the square of the ¢ statistic for 
that same restriction, the Wald statistic is the square of the corresponding asymptotic 
t statistic. Suppose that for a parameter 6, > 0, the null hypothesis is Ho : 05; = 1. 
The asymptotic f statistic is (0; — 1)/se(0,), where se(O,) is the asymptotic standard 
error of 6;. Now define ¢, = log(01), so that 4,; =log(01) and ¢, = log(6,). The 
null hypothesis can be stated as Ho : øp; = 0. Using the delta method (see Chapter 3), 
se(¢y) = 6;! se(1), and so the ż statistic based on ¢, is ¢,/se(¢,) = log(4,)6; /se(01) 
# (0; — 1)/se(01). 

The lack of invariance of the Wald statistic is discussed in more detail by Gregory 
and Veall (1985), Phillips and Park (1988), and Davidson and MacKinnon (1993, 
Sect. 13.6). The lack of invariance is a cause for concern because it suggests that the 
Wald statistic can have poor finite sample properties for testing nonlinear hypotheses. 
What is much less clear is that the lack of invariance has led empirical researchers to 
search over different statements of the null hypothesis in order to obtain a desired 
result. 


12.6.2 Score (or Lagrange Multiplier) Tests 


In cases where the unrestricted model is difficult to estimate but the restricted model 
is relatively simple to estimate, it is convenient to have a statistic that only requires 
estimation under the null. Such a statistic is Rao’s (1948) score statistic, also called 
the Lagrange multiplier statistic in econometrics, based on the work of Aitchison and 
Silvey (1958). We will focus on Rao’s original motivation for the statistic because it 
leads more directly to test statistics that are used in econometrics. An important point 
is that, even though Rao, Aitchison and Silvey, Engle (1984), and many others focused 
on the maximum likelihood setup, the score principle is applicable to any problem 
where the estimators solve a first-order condition, including the general class of M- 
estimators. 

The score approach is ideally suited for specification testing. Typically, the first 
step in specification testing is to begin with a popular model—one that is relatively 
easy to estimate and interpret—and nest it within a more complicated model. Then 
the popular model is tested against the more general alternative to determine if the 
original model is misspecified. We do not want to estimate the more complicated 
model unless there is significant evidence against the restricted form of the model. 
In stating the null and alternative hypotheses, there is no difference between specifi- 
cation testing and classical tests of parameter restrictions. However, in practice, 
specification testing gives primary importance to the restricted model, and we may 
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have no intention of actually estimating the general model even if the null model is 
rejected. 

We will derive the score test only in the case where no correction is needed for 
preliminary estimation of nuisance parameters: either there are no such parameters 
present, or assumption (12.37) holds under Ho. If nuisance parameters are present, 
we do not explicitly show the score and Hessian depending on ». 

We again assume that there are Q continuously differentiable restrictions imposed 
on 8, under Ho, c(@,) = O. However, we must also assume that the restrictions de- 
fine a mapping from IR? to R”, say, d : R? 2 — R”. In particular, under the null 
hypothesis, we can write 0, = d(A,), where 4, is a (P — Q) x 1 vector. We must as- 
sume that A, is in the interior of its parameter space, A, under Ho. We also assume 
that d is twice continuously differentiable on the interior of A. 

Let J be the solution to the constrained minimization problem 


min 5 ali d(2)] (12.64) 
i=1 


The constrained estimator of 0, is simply 6 = d(A). In practice, we do not have to 
explicitly find the function d; solving problem (12.64) is easily done just by directly 
imposing the restrictions, especially when the restrictions set certain parameters to 
hypothesized values (such as zero). Then, we just minimize the resulting objective 
function over the free parameters. 

As an example, consider the nonlinear regression model 


m(x, 0) = exp[xf + ôi (xB)? + 69(xB)"), 


where x is | x K and contains unity as its first element. The null hypothesis is 
Ho : 6; = 62 = 0, so that the model with the restrictions imposed is just an exponen- 
tial regression function, m(x, $) = exp(xf). 

The simplest method for deriving the LM test is to use Rao’s score principle 
extended to the M-estimator case. The LM statistic is based on the limiting distribu- 
tion of 


N 

NS °s(8) (12.65) 
i=l 

under Ho. This is the score with respect to the entire vector 0, but we are evaluating it 


at the restricted estimates. If 8 were replaced by Ê, then expression (12.65) would be 
identically zero, which would make it useless as a test statistic. If the restrictions 
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imposed by the null hypothesis are true, then expression (12.65) will not be statisti- 
cally different from zero. 

Assume initially that 0, is in the interior of © under Ho; we will discuss how to 
relax this assumption later. Now VN(6— 0) = O,(1) by the delta method because 
VN(A— 4.) = O,(1) under the given assumptions. A standard mean value expansion 
yields 


N57 (8) = wry s o) + Ao VN (8 — 85) + 0p(1) (12.66) 
il 


under Ho, where Ag is given in expression (12.19). But 0 = VNe(0) = VNc(0.) + 
CV/N(6 — 0s), where C is the Q x P Jacobian matrix C(0) with rows evaluated at 
mean values between @ and 0.. Under Ho, ¢(@,) =0, and plim C = C(0@,) = Co. 
Therefore, under Ho, CoVN(0 — 0) = o,(1), and so multiplying equation (12.66) 
through by C,A;! gives 


N 
CAZ N S "5;(8) = CAZ NT 50, ) + (1). (12.67) 
i=l i=1 
By the CLT, CAZIN- SN s,(0,) 4 Normal(0,C,A2!BoA'C!), where Bo is 
defined in expression (12.20). Under our assumptions, CoA} 'B, A7 'C/ has full rank 
Q, and so 


be 


The score or LM statistic is given by 


N k N 
LM= (ssa 1Č'(ČĂ IBA IČ’) IČĂ ($a) (12.68) 
i=l i=] 


where all quantities are evaluated at 6. For example, Č = C(6), B is given in expres- 
sion (12.45) but with @ in place of Ê, and A is one of the Sues in expression 
(12.42) or (12.44), again evaluated at 8. Under Ho, LM KA Xo . Because B is at least 
p.s.d., LM > 0. 

For the Wald statistic we assumed that 6, € int(®) under Ho; this assumption is 
crucial for the statistic to have a limiting chi-square distribution. We will not consider 
the Wald statistic when 8, is on the boundary of ®© under Hp; see Wolak (1991) for 
some results. The general derivation of the LM statistic also assumed that 0, € int(®) 


AIC! (CAZ BAZ Ch)! 


WAS s J | 2. 


424 Chapter 12 


under Ho. Nevertheless, for certain applications of the LM test we can drop the 
requirement that ĝo is in the interior of © under Ho. A leading case occurs when 0 
can be partitioned as 0 = (0;,05)’, where 0; is (P — Q) x 1 and @> is Q x 1. The null 
hypothesis is Ho : 062 = 0, so that e(0) = 02. It is easy to see that the mean value 
expansion used to derive the LM statistic is valid provided 4, = 0.) is in the interior 
of its parameter space under Ho; 0, = (0/,,0)’ can be on the boundary of ©. This 
observation is useful especially when testing hypotheses about parameters that must 
be either nonnegative or nonpositive. 

If we assume the generalized information matrix equality (12.53) with c = 1, the 
LM statistic simplifies. The simplification results from the following reasoning: (1) 
CD = 0 by the chain rule, where D = V;,d(A), since ed(A)| = 0 for 2 in A. (2) If E is 
a Px Q matrix E with rank Q, F is a P x (P— Q) matrix with rank P — Q, and 
E'F = 0, then E(E’E) 'E’ = Ip — F(F’F) 'F’. (This is simply a statement about 
projections onto orthogonal subspaces.) Choosing E = A!°C! and F = A!” D gives 
ATIPE (CAIC CA = Ip — A!/7D(D'AD) 'D’A'/?, Now, pre- and post- 
multiply this equality by A~!/? to get A'C' (CAIC) CAT! = A~! — D(D'AD) 'D’. 
(3) Plug B = A into expression (12.68) and use step 2, along with the first-order con- 
dition D'(S-", §;) = 0, to get 


N ’ N 
LM = (>. s) M`! (>: s) , (12.69) 
i=l i=l 


where M can be chosen as >", Ai, >)”, Hi, or 37, §:8/. (Each of these expressions 
consistently estimates A, = Bo when divided by N.) The last choice of M results in a 
statistic that is N times the uncentered R-squared, say Rj, from the regression 


long, i=1,2,...,N. (12.70) 


(Recall that §; is a 1 x P vector.) Because the dependent variable in regression (12.70) 
is unity, NR? is equivalent to N — SSRo, where SSRo is the sum of squared residuals 
from regression (12.70). This is often called the outer product of the score LM statistic 
because of the estimator it uses for Ay. While this statistic is simple to compute, there 
is ample evidence that it can have severe size distortions (typically, the null hypothe- 
sis is rejected much more often than the nominal size of the test). See, for example, 
Davidson and MacKinnon (1993), Bera and McKenzie (1986), Orme (1990), and 
Chesher and Spady (1991). 

The Hessian form of the LM statistic uses M = >, Hj, and it has a few draw- 
backs: (1) because M is the Hessian for the full model evaluated at the restricted 
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estimates it may not be positive definite, in which case the LM statistic can be nega- 
tive; (2) it requires computation of the second derivatives; and (3) it is not invariant to 
reparameterizations. We will discuss the last problem later. 

A statistic that always avoids the first problem, and often the second and third 
problems, is based on E[H(w, 8.) | x], assuming that w partitions into endogenous 
variables y and exogenous variables x. We call the LM statistic that uses M = 
SA | A; the expected Hessian form of the LM statistic. This name comes from the fact 
that the statistic is based on the conditional expectation of H(w, 0.) given x. When it 
can be computed, the expected Hessian form is usually preferred because it tends to 
have the best small sample properties. 

The LM statistic in equation (12.69) is valid only when By = Ao, and therefore it is 
not robust to failures of auxiliary assumptions in some important models. If By # Ao, 
the limiting distribution of equation (12.69) is not chi-square and is not suitable for 
testing. 

In the context of NLS, the expected Hessian form of the LM statistic needs to be 
modified for the presence of a2, assuming that Assumption NLS.3 holds under Ho. 
Let 2? = N-' YÀ @ be the estimate of o? using the restricted estimator of 0 : a = 
Vi- m(x;, 8), i=1,2,...,N. It is customary not to make a degrees-of-freedom ad- 
justment when estimating the variance using the null estimates, partly because the 
sum of squared residuals for the restricted model is always larger than for the un- 
restricted model. The score evaluated at the restricted estimates can be written as 
S; = Vornu;. Thus the LM statistic that imposes homoskedasticity is 


=] 


N '/ N N 
LM = (> vom (£ Varila) (> von le (12.71) 


A little algebra shows that this expression is identical to N times the uncentered R- 
squared, R2, from the auxiliary regression 


a; on Vom,  i=1,2,...,N. (12.72) 


In other words, just regress the residuals from the restricted model on the gradient 
with respect to the unrestricted mean function but evaluated at the restricted esti- 
mates. Under Ho and Assumption NLS.3, LM = NR? ~ Xo- 


In the nonlinear regression example with m(x,0) = exp[xP + ôi (xp) + 62(xB)’], 


let B be the restricted NLS estimator with 6; = 0 and 62 = 0; in other words, $ is from 
a nonlinear regression with an exponential regression function. The restricted resid- 
uals are i; = y; — exp(x;f), and the gradient of m(x, 0) with respect to all parameters, 
evaluated at the null, is 
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Vomn(xi, Bo, 9) = {Xi exp(xiB,), bo) exp(Xx;B,), (Xba)? exp(xiB,)}- 


Plugging in f gives Vom; = [xim;, (xi) Mui, (xf) Mi], where ñ; = exp(x,B). Regres- 
sion (12.72) becomes 


ii; on xjmj, (xip) ħin (xp) m,  i=1,2,...,N. (12.73) 


Under Ho and homoskedasticity, NR? ~ 73, since there are two restrictions being 
tested. This is a fairly simple way to test the exponential functional form without ever 
estimating the more complicated alternative model. Other models that nest the ex- 
ponential model are discussed in Wooldridge (1992). 

This example illustrates an important point: even though peu (Xñ); is identi- 
cally zero by the first-order condition for NLS, the term x,;m; must generally be 
included in regression (12.73). The R-squared from the regression without x;M; will 
be different because the remaining regressors in regression (12.73) are usually corre- 
lated with x); in the sample. (More important, for h = 2 and 3, (x;B,)" exp(x;B,) is 
probably correlated with x; exp(x;f,) in the population.) As a general rule, the entire 
gradient Vom; must appear in the auxiliary regression. 

In order to be robust against failure of Assumption NLS.3, the more general form 
of the statistic in expression (12.68) should be used. Fortunately, this statistic also 
can be easily computed for most hypotheses. Partition 0 into the (P — Q) x 1 vector 
B and the Q vector ô. Assume that the null hypothesis is Hg : ôs = 6, where 6 
is a prespecified vector (often containing all zeros, but not always). Let Vgm; 
[1 x (P — Q)| and Vm; (1 x Q) denote the gradients with respect to $ and ô, respec- 
tively, evaluated at f and 6. After tedious algebra, and using the special structure 
C(A) = [0 | Io], where 0 is a Q x (P — Q) matrix of zero, the following procedure can 
be shown to produce expression (12.68): 


1. Run a multivariate regression 
Vs; on Vgmj, P= ag (12.74) 


and save the 1 x Q vector residuals, say f;. Then, for each i, form &;f;. (That is, mul- 
tiply a; by each element of f;.) 


2. LM = N —SSRo = NRẹ from the regression 
lona@é;, i=1,2,...,N (12.75) 


where SSRo is the usual sum of squared residuals and Rj is the uncentered R- 
squared. This step produces a statistic that has a limiting Xo distribution whether or 
not Assumption NLS.3 holds. See Wooldridge (1991a) for more discussion. 
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We can illustrate the heteroskedasticity-robust test using the preceding exponential 
model. Regression (12.74) is the same as regressing each of (x;ĝ) ñ; and (xf) M; 
onto x;m;, and saving the residuals 7;; and Fn, respectively (NV each). Then, regression 
(12.75) is simply 1 on uF, uri. The number of regressors in the final regression of 
the robust test is always the same as the degrees of freedom of the test. 

Finally, these procedures are easily modified for WNLS. Simply multiply both a; 
and Vom; by 1/ Vii, where the variance estimates ; are based on the null model (so 
we use a ~ rather than a A). The nonrobust LM statistic that maintains Assumption 
WNLS.3 is obtained as in regression (12.72). The robust form, which allows 
Var(y|x) 4 02h(x, y,), follows exactly as in regressions (12.74) and (12.75). 

In examples like the previous one, there is an alternative, asymptotically equivalent 
method of obtaining a test statistic that, like the score test, only requires estimation 
under the null hypothesis. A variable addition test (VAT) is obtained by adding 
(estimated) terms to a standard model. Once the additional variables have been 
defined, calculation is typically straightforward with existing software, and robust 
tests—trobust to heteroskedasticity in the case of NLS—are easy to obtain. 

In the previous example, where we focused on NLS, we again obtain B from 
NLS using the exponential regression function. We then compute (x,f)* and 
(x;B)°. Next, we estimate an expanded exponential mean function that includes the 
original regressors, x;, and the two additional regressors, say, 2; = (x;B)° and 
Zip = (x,B)°. That is, the auxiliary model (used for testing purposes only) is 
exp(x;B + 012Z;; + %22Z;2). The VAT is obtained as a standard Wald test for joint ex- 
clusion of Z; and Žņ (and therefore has an asymptotic y2 distribution). For the non- 
robust test, one difference between the LM test and the VAT is that the latter uses a 
different variance estimate (because of the additional terms Z; and 2,2 that have 
coefficients estimated rather than set to zero). Under the null, both estimators con- 
verge to a2. It is easy to make VAT tests robust to heteroskedasticity whenever one is 
using an econometrics package that computes heteroskedasticity-robust Wald tests of 
exclusion restrictions; typically, it simply requires adding a “robust” option to an 
estimation command such as NLS. 

For NLS (and WNLS), the VAT approach can be applied more generally when 
the mean function has the form m(x, B,d) = R(a(xf,x,6)), where a(k,x,0) =k 
and da(k,x,0)/ék = 1. Then m(x, 6,0) = R(xf) and Vgm(x, B,0) = r(xf) - x, where 
r(-) is the derivative of R(-). The vector of variables to be added is simply 
Ži = Voa(xiB,x;,0), a 1 x Q vector, where ĝ is the NLS estimator using the mean 
function R(x;P). In other words, we use the augmented regression function 
R(x:ß + žia) in NLS and test joint exclusion of z;. Note that this “model” is not 
correctly specified under the alternative. We estimate the augmented equation to 
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obtain a simple test. The VAT procedure is easy to implement when the mean func- 
tion R(-) with linear functions inside is easily estimated. (Examples include the ex- 
ponential and logistic functions, and some others we encounter in Part IV.) We will 
see how to obtain VAT tests with other estimation methods in future chapters. 

The invariance issue for the score statistic is somewhat complicated, but several 
results are known. First, it is easy to see that the outer product form of the statistic is 
invariant to differentiable reparameterizations. Write ¢ = g(0) as a twice continu- 
ously differentiable, invertible reparameterization; thus the P x P Jacobian of g, 
G(@), is nonsingular for all 0 € ©. The objective function in terms of ¢ is q4(w, @), 
and we must have q/|w, g(0)] = q(w, 0) for all 0 e ©. Differentiating and transposing 
gives s(w, 0) = G(0)'s‘|w, g(0)], where s/(w,) is the score of q4[w,]. If $ is the 
restricted estimator of ¢, then ¢ = g(0), and so, for each observation i, $f = (G’)';. 
Plugging this equation into the LM statistic in equation (12.69), with M chosen as the 
outer product form, shows that the statistic based on §/ is identical to that based on §;. 

Score statistics based on the estimated Hessian are not generally invariant to re- 
parameterization because they can involve second derivatives of the function g(@); see 
Davidson and MacKinnon (1993, Sect. 13.6) for details. However, when w partitions 
as (x,y), score statistics based on the expected Hessian (conditional on x), A(x, 0), 
are often invariant. In Chapter 13 we will see that this is always the case for condi- 
tional maximum likelihood estimation. Invariance also holds for NLS and WNLS 
for both the usual and robust LM statistics because any reparameterization comes 
through the conditional mean. Predicted values and residuals are invariant to repar- 
ameterization, and the statistics obtained from regressions (12.72) and (12.75) only 
involve the residuals and first derivatives of the conditional mean function. As with 
the outer product LM statistic, the Jacobian in the first derivative cancels out. 


12.6.3 Tests Based on the Change in the Objective Function 


When both the restricted and unrestricted models are easy to estimate, a test based on 
the change in the objective function can greatly simplify the mechanics of obtaining 
a test statistic: we only need to obtain the value of the objective function with and 
without the restrictions imposed. However, the computational simplicity comes at 
a price in terms of robustness. Unlike the Wald and score tests, a test based on the 
change in the objective function cannot be made robust to general failure of as- 
sumption (12.53). Therefore, throughout this subsection we assume that the general- 
ized information matrix equality holds. Because the minimized objective function is 
invariant with respect to any reparameterization, the test statistic is invariant. 

In the context of two-step estimators, we must also assume that 7 has no effect on 
the asymptotic distribution of the M-estimator. That is, we maintain assumption 
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(12.37) when nuisance parameter estimates appear in the objective function (see 
Problem 12.8). 

We first consider the case where a = 1, so that B, = Ay. Using a second-order 
Taylor expansion, 


N 


N N N 
See ge = (Eso) e-o + 0/2900) (So) ü- 
i=l i=l i=l 


i=1 


where Ë; is the P x P Hessian evaluate at mean values between 6 and Ô. Therefore, 
under Hp (using the first-order condition for 6), we have 


2Sa ð -Ya q(wi, Ô |- [VN (8 — 8)]'Ao[VN (8 — ô)] + 0,(1), (12.76) 
i=1 i=l 


since N! SX, H; = Ao + op (1) and VN(8 — 6) = O,(1). In fact, it follows from 
equations (12.33) (without 7) and (12.66) that VN(8 — 0) = AZ! N'YA s,(8) + 
o,(1). Plugging this equation into equation (12.76) shows that 


OLR =2 
il 


N d N 
= nds) Av! “(reds + 0,(1), (12.77) 
i=l i=l 


so that QLR has the same limiting distribution, Xo» as the LM statistic under Ho. (See 
equation (12.69), remembering that plim(M/N) = Ay.) We call statistic (12.77) the 
quasi-likelihood ratio (QLR) statistic, which comes from the fact that the leading ex- 
ample of equation (12.77) is the likelihood ratio statistic in the context of maximum 
likelihood estimation, as we will see in Chapter 13. We could also call equation 
(12.77) a criterion function statistic, as it is based on the difference in the criterion or 
objective function with and without the restrictions imposed. 

When nuisance parameters are present, the same estimate, say 7, should be used in 
obtaining the restricted and unrestricted estimates. This is to ensure that OLR is 
nonnegative given any sample. Typically, 7 would be based on initial estimation of 
the unrestricted model. 

If oł £1, we simply divide QLR by G’, which is a consistent estimator of o2 
obtained from the unrestricted estimation. For example, consider NLS under 
Assumptions NLS.1—NLS.3. When equation (12.77) is divided by ô? in equation 
(12.57), we obtain (SSR, — SSR,,)/[SSRu,/(N — P)], where SSR, and SSR,,, are the 
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restricted and unrestricted sums of squared residuals. Sometimes an F version of this 
statistic is used instead, which is obtained by dividing the chi-square version by Q: 
(SSR, —SSR,,) (N — P) 


F= : ; 12.78 
SSR, Q l ) 


This has exactly the same form as the F statistic from classical linear regression 
analysis. Under the null hypothesis and homoskedasticity, F can be treated as having 
an approximate Fo p-p distribution. (As always, this association is justified because 
Q: Fo N-p ~ Xo as N — P — œ.) Some authors (for example, Gallant, 1987) have 
found that F has better finite-sample properties than the chi-square version of the 
statistic. 

For weighted NLS, the same statistic works under Assumption WNLS.3 provided 
the residuals (both restricted and unrestricted) are weighted by 1 MV hy , where the h; 
are obtained from estimation of the unrestricted model. 


12.6.4 Behavior of the Statistics under Alternatives 


To keep the notation and assumptions as simple as possible, and to focus on the 
computation of valid test statistics under various assumptions, we have only derived 
the limiting distribution of the classical test statistics under the null hypothesis. It is 
also important to know how the tests behave under alternative hypotheses in order to 
choose a test with the highest power. 

All the tests we have discussed are consistent against the alternatives they are spe- 
cifically designed against. While test consistency is desirable, it tells us nothing about 
the likely finite-sample power that a statistic will have against particular alternatives. 
A framework that allows us to say more uses the notion of a sequence of local alter- 
natives. Specifying a local alternative is a device that can approximate the finite- 
sample power of test statistics for alternatives “close” to Ho. If the null hypothesis is 
Ho : ¢(8.) = 0, then a sequence of local alternatives is 


HV : e(o y) = 00/VN, (12.79) 


where 6, is a given Q x 1 vector. As N — œ, HY approaches Hp, since ôo/ VN — 0. 
The division by VN means that the alternatives are local: for given N, equation 
(12.79) is an alternative to Hp, but as N — oo, the alternative gets closer to Ho. 
Dividing 6, by VN ensures that each of the statistics has a well-defined limiting dis- 
tribution under the alternative that differs from the limiting distribution under Hp. 

It can be shown that, under equation (12.79), the general forms of the Wald and 
LM statistics have a limiting noncentral chi-square distribution with Q degrees of 
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freedom under the regularity conditions used to obtain their null limiting distribu- 
tions. The noncentrality parameter depends on Ay, Bo, Co, and ôo, and can be esti- 
mated by using consistent estimators of Ay, Bo, and Co. When we add assumption 
(12.53), then the special versions of the Wald and LM statistics and the QLR statis- 
tics have limiting noncentral chi-square distributions. For various do, we can estimate 
what is known as the asymptotic local power of the test statistics by computing 
probabilities from noncentral chi-square distributions. 

Consider the Wald statistic where Bo = Ao. Denote by Oo the limit of 05.) as 
N — oo. The usual mean value expansion under HY gives 


VNe(8) = ôo + C(Oo) VN (8 — 8o,) + 0p(1) 


and, under standard assumptions, vVN(Ô-— 8» x) ~ Normal(0, A,'). There- 
fore, V Ne(ĝ) ~ Normal(6,,C,A,'C.) under the sequence (12.79). This result 
implies that the Wald statistic has a limiting noncentral chi-square distribution with 
Q degrees of freedom and noncentrality parameter 5)(CoAS'C!) 60. This turns out 
to be the same noncentrality parameter for the LM and QLR statistics when 
B» = Ao. The details are similar to those under Ho; see, for example, Gallant (1987, 
Sect. 3.6). 

The statistic with the largest noncentrality parameter has the largest asymptotic 
local power. For choosing among the Wald, LM, and QLR statistics, this criterion 
does not help: they all have the same noncentrality parameters under the local alter- 
natives (12.79) and assumption (12.53). Without assumption (12.53), the robust Wald 
and LM statistics have the same noncentrality parameter. 

The notion of local alternatives is useful when choosing among statistics based on 
different estimators. Not surprisingly, the more efficient estimator produces tests with 
the best asymptotic local power under standard assumptions. But we should keep in 
mind the efficiency versus robustness trade-off, especially when efficient test statistics 
are computed under tenuous assumptions. 

General analyses under local alternatives are available in Gallant (1987), Gallant 
and White (1988), and White (1994). See Andrews (1989) for innovative suggestions 
for using local power analysis in applied work. 


12.7 Optimization Methods 


In this section we briefly discuss three iterative schemes that can be used to solve the 
general minimization problem (12.8) or (12.31). In the latter case, the minimization 
is only over 0, so the presence of f changes nothing. If 7 is present, the score and 
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Hessian with respect to 0 are simply evaluated at y. These methods are closely related 
to the asymptotic variance matrix estimators and test statistics we discussed in Sec- 
tions 12.5 and 12.6. 


12.7.1 Newton-Raphson Method 


Iterative methods are defined by an algorithm for going from one iteration to the 
next. Let 6%! be the P x 1 vector on the gth iteration, and let OYT} be the value on 
the next iteration. To motivate how we get from 0} to 09+) use a mean value ex- 
pansion (row by row) to write 


N N 
X s(00)) = S>s\(0 (0°) 


i=] i=1 


+ fone (0%) la gigt 09) + lot (12.80) 


where s;(0) is the P x 1 score with respect to 0, evaluated at observation i, H;(0) is 
the P x P Hessian, and rt” is a P x 1 vector of remainder terms. We are trying to 
find the solution Ê to equation (12.14). If 0'7*!} = 6, then the left-hand side of equa- 
tion (12.80) is zero. Setting the left-hand side to zero, ignoring rY, and assuming that 
the Hessian evaluated at 01% is nonsingular, we can write 


=i 
git) = gta — 5 maw) 5 son]: (12.81) 
i=] 


i=1 


Equation (12.81) provides an iterative method for finding Ê. To begin the iterations 
we must choose a vector of starting values; call this vector 0}, Good starting values 
are often difficult to come by, and sometimes we must experiment with several 
choices before the problem converges. Ideally, the iterations wind up at the same 
place regardless of the starting values, but this outcome is not guaranteed. Given the 
starting values, we plug 6‘°! into the right-hand side of equation (12.81) to get ott, 
Then, we plug 6f} into equation (12.81) to get 01”, and so on. 

If the iterations are proceeding toward the minimum, the increments 09+!) — gt% 
will eventually become very small: as we near the solution, = 1 s;(0'7!) gets close to 
zero. Some use as a stopping rule the requirement that the largest absolute change 
ja — oi, for j= 1,2,..., P, be smaller than some small constant; others prefer 
to look at the largest percentage change in the parameter values. 

Another popular stopping rule is based on the quadratic form 


N 'T HN N 
X s;(0'9!) ] bs mom] e s;(0'9) ji (12.82) 
l 


i=l i=l 
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where the iterations stop when expression (12.82) is less than some suitably small 
number, say .0001. 

The iterative scheme just outlined is usually called the Newton-Raphson method. 
It is known to work in a variety of circumstances. Our motivation here has been 
heuristic, and we will not investigate situations under which the Newton-Raphson 
method does not work well. (See, for example, Quandt, 1983, for some theoretical 
results.) The Newton-Raphson method has some drawbacks. First, it requires com- 
puting the second derivatives of the objective function at every iteration. These cal- 
culations are not very taxing if closed forms for the second partials are available, but 
in many cases they are not. A second problem is that, as we saw for the case of 
nonlinear least squares, the Hessian evaluated at a particular value of 0 may not be 
positive definite. If the inverted Hessian in expression (12.81) is not positive definite, 
the procedure may head in the wrong direction. 

We should always check that progress is being made from one iteration to the next 
by computing the difference in the values of the objective function from one iteration 
to the next: 


N N 
y qi(01+) — ip qi(0'9). (12.83) 
i=l i=l 


Because we are minimizing the objective function, we should not take the step from g 
to g + 1 unless expression (12.83) is negative. (If we are maximizing the function, the 
iterations in equation (12.81) can still be used because the expansion in equation 
(12.80) is still appropriate, but then we want expression (12.83) to be positive.) 

A slight modification of the Newton-Raphson method is sometimes useful to speed 
up convergence: multiply the Hessian term in expression (12.81) by a positive num- 
ber, say r, known as the step size. Sometimes the step size r = 1 produces too large a 
change in the parameters. If the objective function does not decrease using r = 1, 
then try, say, r = 5. Again, check the value of the objective function. If it has now 
decreased, go on to the next iteration (where r = 1 is usually used at the beginning of 
each iteration); if the objective function still has not decreased, replace r with, say, L 
Continue halving r until the objective function decreases. If you have not succeeded 
in decreasing the objective function after several choices of r, new starting values 
might be needed. Or, a different optimization method might be needed. 


12.7.2 Berndt, Hall, Hall, and Hausman Algorithm 


In the context of maximum likelihood estimation, Berndt, Hall, Hall, and Hausman 
(1974) (hereafter, BHHH) proposed using the outer product of the score in place of 
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the Hessian. This method can be applied in the general M-estimation case (even 
though the information matrix equality (12.53) that motivates the method need not 
hold). The BHHH iteration for a minimization problem is 


n -1 
gig+t = gigi _ ab soat! | 


i=1 


sa") (12.84) 


i=l 


where r is the step size. (If we want to maximize = q(w;,9), the minus sign in 
equation (12.84) should be replaced with a plus sign.) The term multiplying r, some- 
times called the direction for the next iteration, can be obtained as the P x 1 OLS 
coefficients from the regression 


lons,(0)', = i=1,2,...,N. (12.85) 


The BHHH procedure is easy to implement because it requires computation of the 
score only; second derivatives are not needed. Further, because the sum of the outer 
product of the scores is always at least p.s.d., it does not suffer from the potential 
nonpositive definiteness of the Hessian. 

A convenient stopping rule for the BHHH method is obtained as in expression 
(12.82), but with the sum of the outer products of the score replacing the sum of the 
Hessians. This is identical to N times the uncentered R-squared from regression 
(12.85). Interestingly, this is the same regression used to obtain the outer product of 
the score form of the LM statistic when B, = Ao, a feature that suggests a natural 
method for estimating a complicated model after a simpler version of the model has 
been estimated. Set the starting value, gi, equal to the vector of restricted estimates, 
6. Then NR? from the regression used to obtain the first iteration can be used to test 
the restricted model against the more general model to be estimated; if the restrictions 
are not rejected, we could just stop the iterations. Of course, as we discussed in Sec- 
tion 12.6.2, the outer-product form of the LM statistic is often ill-behaved even with 
fairly large sample sizes. 


12.7.3 Generalized Gauss-Newton Method 


The final iteration scheme we cover is closely related to the estimator of the expected 
value of the Hessian in expression (12.44). Let A(x,0,) be the expected value of 
H(w, 0%) conditional on x, where w is partitioned into y and x. Then the generalized 
Gauss-Newton method uses the updating equation 


N ry 
Sao") 5 sca"), (12.86) 
i=l 


i=l 


oor — gla} —r 
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where 0°} replaces 6o in A(x;, 0o). (As before, A; and s; might also depend on 9.) 
This scheme works well when A(x, 0o) can be obtained in closed form. 

In the special case of nonlinear least squares, we obtain what is traditionally 
called the Gauss-Newton method (for example, Quandt, 1983). Because s;(0) = 
—Vom;(0)'|y; — m;(0)], the iteration step is 


oth — oi% 4. (x. Vomi” Vom! e) (>. Vo mu P) 


i=1 


The term multiplying the step size r is obtained as the OLS coefficients of the re- 
gression of the residuals on the gradient, both evaluated at 0%. The stopping rule 
can be based on N times the uncentered R-squared from this regression. Note how 
closely the Gauss-Newton method of optimization is related to the regression used to 
obtain the nonrobust LM statistic (see regression (12.72)). 


12.7.4 Concentrating Parameters out of the Objective Function 


In some cases, it is computationally convenient to concentrate one set of parameters 
out of the objective function. Partition 0 into the vectors p and y. Then the first-order 
conditions that define @ are 


N N 
S2Vpa(wisB.y) =0, XO Vaw B, y) = 0. (12.87) 
i=l i=l 


Rather than solving these for f and j, suppose that the second set of equations can be 
solved for y as a function of W = (wi, W2,...,wy) and £$ for any outcomes W and 
any £ in the parameter space: y = g(W, $). Then, by construction, 


N 
X V,qlwi, B, g(W, B)] = 0. (12.88) 
i=l 


When we plug g(W,£) into the original objective function, we obtain the con- 
centrated objective function, 


> iwi, B. 8(W, p)). (12.89) 


Under standard differentiability assumptions, the minimizer of equation (12.89) is 
identical to the Ê that solves equations (12.87) (along with f), as can be seen by dif- 
ferentiating equation (12.89) with respect to # using the chain rule, setting the result 
to zero, and using equation (12.88); then ĵ can be obtained as g(W, ĝ). 
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As a device for studying asymptotic properties, the concentrated objective function 
is of limited value because g(W, £) generally depends on all of W, in which case 
the objective function cannot be written as the sum of independent, identically dis- 
tributed summands. One setting where equation (12.89) is a sum of i.i.d. functions 
occurs when we concentrate out individual-specific effects from certain nonlinear 
panel data models. In addition, the concentrated objective function can be useful for 
establishing the equivalence of seemingly different estimation approaches. 


12.8 Simulation and Resampling Methods 


So far we have focused on the asymptotic properties of M-estimators, as these pro- 
vide a unified framework for inference. But there are a few good reasons to go be- 
yond asymptotic results, at least in some cases. First, the asymptotic approximations 
need not be very good, especially with small sample sizes, highly nonlinear models, 
or unusual features of the population distribution of w;. Simulation methods, while 
always special, can help determine how well the asymptotic approximations work. 
Resampling methods can allow us to improve on the asymptotic distribution 
approximations. 

Even if we feel comfortable with asymptotic approximations to the distribution of 
6, we may not be as confident in the approximations for estimating a nonlinear 
function of the parameters, say y, = g(0o). Under the assumptions in Section 3.5.2, 
we can use the delta method to approximate the variance of 7 = g(0). Depending on 
the nature of g(-), applying the delta method might be difficult, and it might not re- 
sult in a very good approximation. Resampling methods can simplify the calculation 
of standard errors, confidence intervals, and p-values for test statistics, and we can 
get a good idea of the amount of finite-sample bias in the estimation method. In ad- 
dition, under certain assumptions and for certain statistics, resampling methods can 
provide quantifiable improvements to the usual asymptotics. 


12.8.1 Monte Carlo Simulation 


In a Monte Carlo simulation, we attempt to estimate the mean and variance— 
assuming that these exist—and possibly other features of the distribution of the M- 
estimator, Ê. The idea is usually to determine how much bias Ê has for estimating 0, 
or to determine the efficiency of 6 compared with other estimators of 0,. In addition, 
we often want to know how well the asymptotic standard errors approximate the 
standard deviations of the 6}. 

To conduct a simulation, we must choose a population distribution for w, which 
depends on the finite-dimensional vector 0o. We must set the values of 0), and decide 
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on a sample size, N. We then draw a random sample of size N from this distribution 
and use the sample to obtain an estimate of 0). We draw a new random sample 
and compute another estimate of 0o. We repeat the process for several iterations, say 
M. Let 6” be the estimate of 0, based on the mth iteration. Given fo) :m= 
1,2,..., M}, we can compute the sample average and sample variance to estimate 
E(6) and Var(ĝ), respectively. We might also form ¢ statistics or other test statistics to 
see how well the asymptotic distributions approximate the finite-sample distributions. 
We can also see how well asymptotic confidence intervals cover the population 
parameter relative to the nominal confidence level. 

A good Monte Carlo study varies the value of 0), the sample size, and even the 
general form of the distribution of w. In fact, it is generally a good idea to check how 
an estimation method fares when the assumptions on which it is based partially or 
completely fail. Obtaining a thorough study can be very challenging, especially for a 
complicated, nonlinear model. First, to get good estimates of the distribution of 0, we 
would like M to be large (perhaps several thousand). But for each Monte Carlo it- 
eration, we must obtain 6”), and this step can be computationally expensive because 
it often requires the iterative methods we discussed in Section 12.7. Repeating the 
simulations for many different sample sizes N, values of 0,, and distributional shapes 
can be very time-consuming. 

In most economic applications, w; is partitioned as (x;, y;). While we can draw the 
full vector w; randomly in the Monte Carlo iterations, sometimes the x; are fixed at 
the beginning of the iterations, and then y, is drawn from the conditional distribution 
given x;. This method simplifies the simulations because we do not need to vary the 
distribution of x; along with the distribution of interest, the distribution of y, given x;. 
If we fix the x; at the beginning of the simulations, the distributional features of Ê that 
we estimate from the Monte Carlo simulations are conditional on {x),x2,...,xy}. 
This conditional approach is especially common in linear and nonlinear regression 
contexts, as well as conditional maximum likelihood. 

It is important not to rely too much on Monte Carlo simulations. Many estimation 
methods, including OLS, IV, and panel data estimators, have asymptotic properties 
that do not depend on underlying distributions. In the nonlinear regression model, 
the NLS estimator is V/N-asymptotically normal, and the usual asymptotic variance 
matrix (12.58) is valid under Assumptions NLS.1—NLS.3. However, in a typical 
Monte Carlo simulation, the implied error, u, is assumed to be independent of x, and 
the distribution of u must be specified. The Monte Carlo results then pertain to this 
distribution, and it can be misleading to extrapolate to different settings. In addition, 
we can never try more than just a small part of the parameter space. Because we 
never know the population value 0., we can never be sure how well our Monte Carlo 
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study describes the underlying population. Hendry (1984) discusses how response 
surface analysis can be used to reduce the specificity of Monte Carlo studies. See also 
Davidson and MacKinnon (1993, Chap. 21). 


12.8.2 Bootstrapping 


A Monte Carlo simulation, although it is informative about how well the asymptotic 
approximations can be expected to work in specific situations, does not generally help 
us refine our inference given a particular sample. (Because we do not know ĝo, we 
cannot know whether our Monte Carlo findings apply to the population we are 
studying. Nevertheless, researchers sometimes use the results of a Monte Carlo sim- 
ulation to obtain rules of thumb for adjusting standard errors or for adjusting critical 
values for test statistics.) The method of bootstrapping, which is a popular resampling 
method, can be used as an alternative to asymptotic approximations for obtaining 
standard errors, confidence intervals, and p-values for test statistics. 

Though there are several variants of the bootstrap, we begin with one that can 
be applied to general M-estimation. The goal is to approximate the distribution of 
6 without relying on the usual first-order asymptotic theory. Let {w1, W2,..., WN} 
denote the outcome of the random sample used to obtain the estimate. The non- 
parametric bootstrap is essentially a Monte Carlo simulation where the observed 
sample is treated as the population. In other words, at each bootstrap iteration, b, a 
random sample of size N is drawn from {w1, W2,...,ww}. (That is, we sample with 
replacement.) In practice, we use a random number generator to obtain N integers 
from the set {1,2,...,N}; in the vast majority of iterations some integers will 
be repeated at least once. These integers index the elements that we draw from 
{w),Wo,..., Wy}; call these fw?) w, pee wo), Next, we use this bootstrap sample 
to obtain the M-estimate 0) by solving 


We iterate the process B times, obtaining 6b =1,..., B. These estimates can now 
be used as in a Monte Carlo simulation. Computing the average of the 6°), say Ô, 
allows us to estimate the bias in Ê, called the bootstrap bias estimate. The sample 
variance, (B — 1)! > a (a) — 6] - (6°) — ôl’, called the bootstrap variance estimate, 
can be used to obtain standard errors for the 6;—the estimates from the original 
sample. For a scalar estimate 7, we obtain its bootstrap standard error as seg(ĵ) = 
(B-17 SE (9 —5)]'/?. Naturally, we can apply this formula to smooth func- 


tions of the original parameter estimates 0, say 7 = g(@) for a continuously differ- 
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entiable function g: IR? — IR. We can then use seg() to construct asymptotic 
hypotheses tests and confidence intervals for y,. 

Especially for computing average partial effects—a topic that will arise repeatedly 
in Part [V—we often need to estimate a parameter that can be written as y, = 
E[g(w;,9o)|. A natural, consistent estimator is } = N~! 37", g(w;, Ô). To estimate its 
asymptotic variance, we must account for the randomness in w; as well as Ê. As be- 
fore, we draw bootstrap samples, and, for bootstrap sample b, the estimate of y, is 


Once we have the collection of bootstrap estimates {9) : b = 1,2,...,B}, we can 
compute the bootstrap bias and, more important, the bootstrap standard error using 
the previous formulas. 

Using the bootstrap standard error to construct statistics and confidence intervals 
is often much easier than the analytical calculations needed to obtain an asymptotic 
standard error based on first-order asymptotics. But using the bootstrap standard 
error to construct test statistics cannot be shown to improve on the approximation 
provided by the usual asymptotic theory. As it turns out, in many cases the bootstrap 
does improve the approximation of the distribution of test statistics. In other words, 
the bootstrap can provide an asymptotic refinement compared with the usual asymp- 
totic theory, but one must use some care in computing the bootstrap test statistics. 

To show that the bootstrap approximation of a distribution converges more 
quickly than the usual rates associated with first-order asymptotics, the notion of an 
asymptotically pivotal statistic is critical. An asymptotically pivotal statistic is one 
whose limiting distribution does not depend on unknown parameters. Asymptotic t 
statistics, Wald statistics, score statistics, and quasi-LR statistics are all asymptoti- 
cally pivotal when they converge to the standard normal distribution, in the case of a 
t statistic, and to the chi-square distribution, in the case of the other statistics. One 
must sometimes use care, though, to ensure a statistic is asymptotically pivotal. For 
example, for a ¢ statistic to be asymptotically pivotal in the context of nonlinear re- 
gression with heteroskedasticity, we must use a heteroskedasticity-robust statistic. 
The Wald and score statistics must often use robust asymptotic variance estimators 
to deliver an asymptotic chi-square distribution. The quasi-LR statistic is guaranteed 
to be asymptotically pivotal only when the generalized information matrix equality 
holds. 

To explain how to bootstrap the critical values for a ¢ statistic, consider testing 
Hp : 0) = ¢ for some known value c. The f statistic, = (Ê — c)/se(6), is asymptotically 
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pivotal if se(@) is appropriately chosen. To obtain a refinement using the bootstrap, 
we must obtain the empirical distribution of the statistic 


1) = (6) — 6) /se(6), 


where Ô is the estimate from the original sample, 6) is the estimate for bootstrap 
sample b, and se(6”) is the standard error estimated from the same bootstrap sam- 
ple. (So, for example, se(6”)) could be a heteroskedasticity-robust standard error for 
NLS.) Notice how the ¢ statistic for each bootstrap replication is centered at the 
original estimate, 6, not the hypothesized value. As discussed by Horowitz (2001), 
centering at the estimate is required to ensure asymptotic refinements of the testing 
procedure. 

The way we obtain bootstrap critical values for a test depends on the nature of the 
alternative. For a one-sided alternative, say Ho: 0 >c, we order the statistics 
{r) :b=1,2,..., B}, from smallest to largest, and we pick the value representing 
the desired quantile of the list of ordered values. For example, to obtain a 5 percent 
test against a greater than one-sided alternative, we choose the critical value as the 
95th percentile in the ordered list of ¢?). For a two-sided alternative, we must choose 
between a nonsymmetric test and a symmetric test. For the former, a test with size « 
chooses critical values as the lower and upper «/2 quantiles of the ordered bootstrap 
test statistics, and we reject Ho if t > cv, or t < cv;. For the latter, we first order the 
absolute values of the statistics, |¢)|, and then choose the upper « quantile as the 
critical value for a test of size «. Naturally, we compare |¢| with the critical value. 
This approach to choosing critical values from bootstrapping is called the percentile-t 
method. 

We can use the percentile-t method to compute a bootstrap p-value. For example, 
against a greater than one-sided alternative, we simply find the fraction of bootstrap t 
statistics ¢”) that exceed ż£. A symmetric p-value for a two-sided alternative does the 
same for |7°)| and |z|. 

Testing multiple hypotheses is similar. Suppose that for a Q-vector ¢,, we want to 
test Ho : Ø, = r, where r is a vector of known constants. The Wald statistic computed 
using the original sample is W = (¢—r)'V-'(@—r). We compute a series of Wald 
statistics from bootstrap samples as 


Ww) = (gp -e (VOEO — 6),  b=1,...,B, 


where we must take care so that the calculation of V (and V)) delivers an asymp- 
totic chi-square statistic. The bootstrap p-value is the fraction of W that exceed W. 

The parametric bootstrap is even more similar to a standard Monte Carlo simula- 
tion because we assume that the distribution of w is known up to the parameters Oo. 
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Let f(-,0@) denote the parametric density. Then, on each bootstrap iteration, we draw 
a random sample of size N from f(-,); this gives fw”), wo, ne wt, and the rest 
of the calculations are the same as in the nonparametric bootstrap. (With the para- 
metric bootstrap, when f(-,0) is a continuous density, only rarely would we find re- 
peated values among the wi”) ) 

When w; is partitioned into (x;,y;), where the x; are conditioning variables, other 
resampling schemes are sometimes preferred. In Chapter 13, we study the method of 
conditional maximum likelihood, where we assume that a model of a conditional 
density, f(y |x; 0), is correctly specified. In that case, we can apply a combination of 
the nonparametric and parametric bootstrap. Because the distribution of x; is 
unspecified, we randomly draw N indexes from {1,2,...,N} to obtain a non- 
parametric bootstrap sample for the conditioning variables, fx?) :i=1,...,N}. 
Then, given the estimate Ê, we obtain y” by drawing from the density f(| x”. 6). 
We then use the bootstrap samples (x, y’) as before. Compared with the fully 
nonparametric bootstrap where we resample the entire vector w; = (x;,y,;) from the 
original data, the method that draws from f(- | x”, Ô) is not as widely applicable and 
is computationally more expensive. 

In cases where we have even more structure, other alternatives are available. For 
example, in a nonlinear regression model with y; = m(x;, 0.) + uj, where the error u; 
is independent of x;, we first compute the NLS estimate Ê and the NLS residuals, 
it; = yi —m(x;,0), i=1,2,...,N. Then, using the procedure described for the 
nonparametric bootstrap, a bootstrap sample of residuals, fat”? :i=1,2,...,N}, 
is obtained, and we compute y® = m(Xi, 6) + a” . Using the generated data 
{(x;, y® ):i=1,2,...,N}, we compute the NLS estimate, 6). This procedure is 
called the nonparametric residual bootstrap. (We resample the residuals and use these 
to generate a sample on the dependent variable, but we do not resample the con- 
ditioning variables, x;.) If the model is nonlinear in 0, this method can be computa- 
tionally demanding because we want B to be several hundred, if not several 
thousand. Nonetheless, such procedures are becoming more and more feasible as 
computational speed increases. When w; has zero conditional mean [E(u; | x;) = 0] but 
is heteroskedastic | Var(u;|x;) depends on x;], alternative sampling methods, in par- 
ticular the wild bootstrap, can be used to obtain heteroskedastic-consistent standard 
errors. See, for example, Horowitz (2001). 

Bootstrapping is easily applied to the kinds of panel data structures that we treat 
in this text because our sampling assumption for panel data is random sampling in 
the cross section dimension. For panel data, we simply let w; = (wi,..., Wir) denote 
the outcomes across all T time periods for cross section observation i. When we ob- 
tain a bootstrap sample, all time periods for a particular unit constitute a single 
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observation. In other words, as with pure cross section applications, we randomly 
select N integers (with replacement) from {1,2,...,N} and obtain the bootstrap 
sample w” = (w® )) F hasis: we d l j iod 

plew; = (wi ,-- -Wir )- For emphasis: we do not resample separate time periods 
within a unit; we only resample units. As with the first-order asymptotic theory based 
on random samples, this kind of bootstrapping for panel data is realistic for large 


cross sections and relatively small time periods. 


12.9 Multivariate Nonlinear Regression Methods 


As with linear regression methods, nonlinear regression methods can be extended to 
systems of equations. For example, suppose that yı and y are fractional variables— 
say, shares of pension investments put into stocks and bonds, respectively, with a 
third “other” category—and we wish to account for the bounded nature of these 
responses in our model. We might specify logistic regression functions of the form 


E(yy |x) = exp(x0,)/[l + exp(x8y)],  g = 1,2, 


where x is a row vector of common explanatory variables. We can estimate 0; and 02 
individually, using NLS or WNLS, but we can also estimate them jointly. Not sur- 
prisingly, under certain assumptions, a multivariate procedure that accounts for cor- 
relation in the unobservables across equations is more efficient than single equation 
estimation, as we discuss in Section 12.9.2. 

We can also apply nonlinear regression to panel data structures. For example, for 
a nonnegative response variable y,, a dynamic model is 


E( y| Ze, vi-1) = exp(z,B + &y-1), GS 15.03.47 
One way to estimate the parameters is by pooled NLS. As we show in the next two 
subsections, we can cast both system and panel data problems in one framework. 


12.9.1 Multivariate Nonlinear Least Squares 


A general setup for multivariate nonlinear least squares (MNLS) is to let y, be the 
response variable for equation g, which could be a time period. The corresponding 
explanatory variables are x,. We assume the model for E( yy | xg) is correctly specified: 


E( yg | Xg) = Mg(Xg Oog), g=1,...,G, (12.90) 


where the parameters in different equations can be distinct or there can be restrictions 
across g. Given N randomly sampled observations, it should be no surprise that the 
Oog can be consistently estimated by solving 
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G 
min > bu — 4 (Xig, RIE 
i=l g=l 


Q 


or, in vector form, 


N 
min > ly; — m(x; 0)]'[y; — m(x;, 8), (12.91) 
i=l 


where @ denotes the vector of all parameters, y; is the G x 1 vector of responses for 
observation i, and m(x;,0) is the G x 1 vector of conditional mean functions. We call 
the solution to (12.91) the multivariate nonlinear least squares estimator. As with 
univariate NLS, identification requires that 0, is the only solution to the corre- 
sponding population problem, and we would assume mean functions twice con- 
tinuously differentiable in the parameters. Generally, the assumptions are as weak as 
in the univariate case. 

If there are G separate parameter vectors without restrictions across g, then the 
solutions to (12.91) are the same as NLS on each equation. Sometimes, for example 
with cost share equations derived from the theory of the firm, the parameters will be 
restricted across equations. Then the MNLS estimator can be used to estimate the 
parameters with the restrictions imposed. 

In the panel data case, where the conditional mean model is written for a common 
set of parameters, 


E( Vir | Xi) = m(Xit, Oo), (12.92) 


the MNLS estimator is the pooled nonlinear least squares (PNLS) estimator. Problem 
12.6 asks you to study this estimator in more detail. Consistency and /N-asymptotic 
normality of the PNLS estimator follows under general assumptions on the smooth- 
ness of the mean function, but generally one should use a fully robust variance matrix 
estimator to account for possible heteroskedasticity and serial correlation. If the 
conditional mean model is dynamically complete (see Problem 12.6), then one need 
not worry about serial correlation in the errors when estimating the variance matrix 
of the PNLS estimator. But heteroskedasticity could still be an issue. If one adds the 
homoskedasticity assumption Var( yi; |X) = 02 to the appropriate no serial correla- 
tion assumption, the usual statistics from the pooled NLS analysis are asymptotically 
valid. 

The PNLS estimator is attractive when one wishes only to impose (12.92), without 
making the stronger assumption that the covariates are strictly exogenous. Even 
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under strict exogeneity, PNLS is usually needed as a first step in obtaining an 
asymptotically more efficient estimator. We study that possibility next. 


12.9.2 Weighted Multivariate Nonlinear Least Squares 


Under certain assumptions, we can use generalized least squares (GLS) methods to 
more efficiently estimate the parameters appearing in a set of conditional mean 
functions. Here we focus on the case where the explanatory variables are strictly ex- 
ogenous. This often applies to systems of equations and sometimes applies to panel 
data methods. Specifically, we assume 


E(y; | x;) = m(x;, 0o), some 0 € © c R”, (12.93) 


where again, y; is a G x 1 vector on the dependent variable and m(x;, 0) is a G x 1 
vector of conditional mean functions. It is important to note that the entire vector of 
covariates, x;, is conditioned on in (12.93). This is the sense in which the explanatory 
variables are strictly exogenous: if some elements of x; are omitted from one of the 
functions m,(x;, 0o), then they should have no partial effect on E( yi, | x;). As we dis- 
cussed in Chapter 7, this is often intended in seemingly unrelated regressions (SUR)- 
type applications; in fact, often the same set of explanatory variables appears in each 
equation. But assumption (12.93) is violated for panel data models with lagged de- 
pendent variables and, as we saw in Chapter 7 for linear models, perhaps in models 
without lagged dependent variables. Nevertheless, we have seen linear models where 
strict exogeneity holds after including the time average of the covariates, and we will 
see this again in later chapters—including Chapters 15, 16, 17, and 18 for nonlinear 
models. 

The most general estimator we consider here is the weighted multivariate nonlinear 
least squares (WMNLS) estimator. We can easily motivate WMNLS. Let W(x;, y) be 
a model for the G x G conditional variance matrix Var(y;|x;). If this model is cor- 
rectly specified, then, generally, we can obtain a N-consistent estimator of the 
“true” parameters in the variance matrix, y,. In fact, we would probably use the 
residuals from an initial MNLS estimation (more on this shortly). Given 9, and 
assuming that W(x;,)) is nonsingular for all i (often true by construction), we can 
estimate 0, by solving 


N 
min Sly, — (x, 0) [W(x D] y: — m(x: 8). (12.94) 
i=1 


The solution to (12.94) is the WMNLS estimator. 
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As we will see in later chapters, the WMNLS estimator can be attractive for a 
broad class of nonlinear models, particularly when we cannot, or do not wish to, 
specify a complete distribution D(y; | x;). Of importance, under (12.93), the WMNLS 
estimator is generally consistent for 0, even if the inverse of the weighting matrix is 
misspecified for 0). The intuition for the robustness of the WMNLS estimator is 
straightforward from the general M-estimation theory. Let y* be the probability limit 
of 7, whether or not the variance is misspecified. Then the WMNLS estimator is 
consistent for 0, if 0, uniquely solves the population problem 


min E{[y; — m(x;, 0)]’[W(xi, yl ly; — mx, 4)]}- 
But straightforward algebra shows that 
Iy; — m(x; OTW 9°)‘ fy; — m(xi, 9)] 
= [y; — m(x; 0o) [W (x; 7°) y; — m(x; 9) 
— 2y; — m(x; 0o) [W 7) m(x; 80) — m(x, 0)] 
+ [m(x;, Ao) — m(x;, 0) [Wx 7] m(x; 00) — m(x;, 0). 


By the law of interated expectations, the term in the middle has zero conditional 
mean by (12.93), and so 


= E{ly; — m(X;, 9o)|'[W(xi,7")] y; — m(xi, 8o)]} 
+ E{[m(x;, 05) — m(x;, 8)]'[W(x:,)*)]7 m(x; 0o) — m(x;, 8)]}. (12.95) 


The first term on the right-hand side in (12.95) does not depend on @ and the second 
term is zero when 0 = 0o. As always, identification requires that the latter term be 
zero only when 0 = 0%. 

Not surprisingly, the WMNLS estimator has desirable asymptotic efficiency prop- 
erties if (12.94) holds, along with 


Var(y; | xi) = W(x, yo) for some y,- (12.96) 


Under these assumptions, it is easy to show that the conditional information matrix 
equality (12.96) holds. Matrix algebra can be used to show directly that the asymp- 
totic variance of Avar VN(Ô — 0,5) under correct second moment specification is 
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smaller than that of any other WMNLS estimator using weights that are a function 
of x; (including, of course, constant weights). In particular, the WMNLS estimator is 
more efficient than MNLS. But remember that MNLS does not require the strict 
exogeneity assumption for consistency. In Chapter 14, we will develop a general ef- 
ficiency framework that allows us to conclude that WMNLS is the asymptotically 
efficient estimator in a broad class of instrumental variables estimators. 

Even if we admit the possibility that our model W(x;,y) is misspecified for the 
conditional variance, there are still good reasons to apply WMNLS, at least under 
(12.93). As with the linear case, it will often be the case that a misspecified model of 
the variance matrix that nevertheless captures key features of the conditional second 
moments might lead to a more efficient estimator of 0, than an estimator that ignores 
variances and covariances: the MNLS estimator. This is the key insight in the gen- 
eralized estimating equation (GEE) literature in statistics—see, for example, Liang 
and Zeger (1986)—which is typically applied to panel data sets (and cluster samples) 
but whose insights also apply to SUR-like systems of equations. Borrowing from 
GEE nomenclature, we refer to W(x,y) as a working variance matrix, which is 
allowed, and in many cases is known, to be misspecified. We discuss some ways of 
choosing this matrix below. The GEE approach is closely related to quasi-maximum 
likelihood methods, and we cover these, along with GEE for panel data, in Chapter 
13. 

Given 7 from a first-step estimation, the first-order conditions for the WMNLS 
estimator, Ê, are 


N 
XO Vom(xi,6)'[W(xi,9)] ly; — (xi, 8)] = 0. (12.97) 
i=l 


(The GEE approach works off a very similar set of moment conditions—namely, the 
first occurrence of Ê is replaced with an initial, VN-consistent estimator of 0.—and 
the GEE estimator is v N-equivalent to the WMNLS estimator. Here we focus on 
equation (12.97).) 

With a possibly misspecified conditional variance matrix, the asymptotic variance 
of Ô should be estimated using a Huber-White sandwich form, 


N Z/N 
z Vom,(8)'[W:(9)] Vam) (>: Vom; (Ô) [W(P] âW: G) Vam) 
i=l i=1 


N —1 
. a Vam OVW "Nom() ) , (12.98) 
fal 
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A 


where the notation should be clear. (As usual, saying that (12.98) is valid for Avar(0) 
means that Avar VN (6 — 0.) is consistently estimated by dividing (12.98) by N.) 
Problem 12.11 asks you to derive this formula, along with the simplified formula— 
with the final two terms in (12.98) dropped—that is valid under Var(y;|x;) = 
W(x;, Ya) for some y,- 

When W(x;, y) is chosen to be a diagonal matrix, the WMNLS estimator is typi- 
cally robust to violations of the strict exogeneity assumption. For example, in a panel 
data setting where E( yj | xj) = mM:(Xir, 0o), we can choose W(x;,y) to be diagonal 
where the ¢th diagonal depends only on x». The resulting estimator is a pooled 
weighted nonlinear least squares (PWNLS) estimator, which is often useful for dy- 
namic panel data models, or other panel data models without strict exogeneity; see 
Problem 12.13. In Chapters 13 and 18 we will cover quasi-maximum likelihood esti- 
mators that are asymptotically equivalent to PWNLS but can be obtained more 
easily via one-step estimation. 

Sometimes we might choose W(x;, y) = Q, that is, use a matrix where the variances 
and covariances do not depend on x;. A consistent estimator of the unconditional 
variance matrix of u; = y; — m(x;, 0o) is 


Q (12.99) 


I 
a 
= 
A 


where the ù; are the vectors of MNLS residuals. When we use Q in (12.99), we have 
what is sometimes called the nonlinear SUR estimator. Equation (12.99) places no 
restrictions on the unconditional variances and covariances. In panel data cases we 
might, for example, restrict Q to have a random effects or an AR(1) structure. The 
asymptotic analysis of the estimator with constant W(x;, y) follows from the general 
WMNLS framework. In particular, the nonlinear SUR estimator is consistent and 
VN-asymptotically normal under (12.93) whether or not Var(y;|x;) is constant. Of 
course, if Var(y; | x;) depends on x;, there could be an alternative WMNLS estimator 
that is more efficient than the nonlinear SUR estimator, but that hinges on our ability 
to find W(x;, y) that provides a “better” approximation to Var(y; | x;) than a constant 
matrix. 

Models for conditional variances Var( yig |x;) are relatively straightforward be- 
cause we have many univariate distributions to draw on (see Section 13.11.3). 
Directly specifying models for conditional covariances, when they actually de- 
pend on x;, is more difficult. A fruitful approach is to specify conditional 
variances for each g but then to nominally assume constant conditional corre- 
lations. Then, if p,, is (nominally) Cort(yig, vin| xi), we have Cov( Vig, Yin | Xi) = 
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PgnlVar( Vig | Xi) Var( vin [x ]1”. Let V(x,q@) be the G x G diagonal matrix with the 
proposed variances down its diagonal, and let R(p) denote the G x G matrix of pro- 
posed constant correlations, which depends on the J-vector of parameters p. In the 
GEE literature, R(p) is called a working correlation matrix because there is no pre- 
sumption that it truly contains the conditional correlations. In fact, there is no 
presumption that the conditional correlations are even constant. 

Given V(x, œ) and R(p), we can write the working variance matrix as 


Wx, y) = V(x, @)'7R(p)V(xi,0) 7, (12.100) 


where V(x;, œ) 1/2 iş the matrix square root. Note that y contains œ and p. 

To implement WMNLS (GEE) under (12.100), we need to estimate w and p. This 
typically proceeds by MNLS to first obtain residuals u;,, then by using ij, and the 
variance models, say v,(x;, œ), to estimate the variances, say Či = vig(Xi, ô). (In many 
cases, the v,(x,@) functions depend on the mean parameters 0 and a single additional 
parameter; we will see this explicitly in Section 13.11.3 and Chapter 18.) We can then 
use the standardized residuals %,/,/vjg for all i and g to estimate the parameters in 
the working correlation matrix R(p). The details depend on how R(p) is specified. 

The most general specification (once we restrict ourselves to a matrix that does not 
depend on x) is an unstructured working correlation matrix, where the elements of 
R(p) are unrestricted (except, of course, for requiring the matrix to be a valid corre- 
lation matrix): 


l Pn Pr a Pic 
P12 l Pz aia P2G 
R(p)=| 3 P3 © 5 l (12.101) 
l i 1 PG-1,G 
Pig P2G ` PG-1,G 1 


Then, we can estimate each p,,, as 
Pyn = Sample Correlation (tig /\/ big, Uin/ V Čin). (12.102) 


Under general conditions, the p,, converge to, say, pj, the population correlation 
between wig //Vg(x;,@*) and uj,//vn(xi,@*), where w* = plim(@). 

A common form of R(p) in panel data settings is an exchangeable working corre- 
lation matrix, which introduces a single correlation parameter, p, for all pairs. Then, 
p is obtained by averaging the p,,, (but where g and h denote different time periods) 
across all nonredundant pairs with g 4 h. An exchangeable correlation matrix allows 
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for a random effects-type correlation structure but with conditional variances 
changing over time. We will have more to say on this setup in Chapter 13. 

Once @ and p have been obtained, the working variance matrix estimates, 
Wi(7) = V(x;, ô) PRAV (x, @)!””, are easy to obtain, and they can be used in 
WMNLS. In special cases it may be reasonable to assume W(x, y) is correctly speci- 
fied for Var(y;|x;), but most of the time one should use equation (12.98) as the 
variance matrix estimator. 

Finally, a final word of caution. The variance matrix estimator (12.98) assumes 
that the conditional mean is correctly specified, as stated explicitly in equation 
(12.93). Thus, this estimator is fully robust only if the conditional mean is correctly 
specified; it is only semirobust if we entertain misspecification of E( y; | x;). Unfortu- 
nately, because WMNLS is a two-step estimator, a variance matrix estimator that 
allows conditional mean misspecification is much more complicated than (12.98) be- 
cause it depends on the sampling variability of 7. In particular, it is not enough just to 
replace the outer ends of the sandwich with the unconditional Hessian evaluated at 
the estimates. (This is why, in the application of WMNLS to GEE, the standard 
errors from (12.98) are often explicitly labeled as semirobust.) 


12.10 Quantile Estimation 


The introduction to this chapter included a brief discussion of the problem of esti- 
mating a conditional median function, and we discussed how the consistency of LAD 
follows from Theorem 12.2. Estimation of conditional medians and, more generally, 
conditional quantiles, is increasingly popular in empirical research. It has been long 
recognized that while estimating conditional mean functions is valuable, the partial 
effect of an explanatory variable can have very different effects across different seg- 
ments of a population. Quantile estimation allows us to study such effects. Koenker 
and Bassett (1978) developed the theory of quantile regression, and Buchinsky (1998), 
Peracchi (2001), and Koenker (2005) provide recent treatments. Chamberlain (1994) 
and Buchinsky (1994) are influential applications of quantile regression to estimate 
changes in the wage distribution in the United States. 


12.10.1 Quantiles, the Estimation Problem, and Consistency 


Let y; denote a random draw from a population. Then, for 0 < t < 1, g(t) is a tth 
quantile of the distribution of y; if P(y; < q(t)) > t and P(y; > g(t)) > 1-71. A 
special case is the median when t = 1/2. For notational convenience, we will write 
the tth quantile of y; as Quant,(y;). 
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Typically, we are interested in modeling quantiles conditional on a set of covariates 
x;. In most applications, the assumption is that the quantiles are linear in parameters 
(and we will show them linear in x;, even though we can, as usual, choose x; to in- 
clude nonlinear functions of underlying explanatory variables). Under linearity, we 
have 


Quant, (yi | xi) = %(t) + xf,(7), (12.103) 


where, for reasons to be seen shortly, we explicitly introduce an intercept and 
explicitly show the intercept and slopes depending on t. 

To estimate the parameters in a conditional quantile function, it is very helpful 
to know whether a population quantile solves a population extremum problem. We 
know that the conditional mean minimizes the expected squared error and, in the 
introduction, we asserted that the conditional median (when t= .5) minimizes 
the expected absolute error. Generally, if go(z) is the tth quantile of y;, then go(t) 
solves 
min E{(tl[yi— 9 = 0) + (1— =)I[yi— 4 < Mly: — alt (12.104) 
where 1[-] is the indicator function equal to one if the statement in brackets is true and 
zero otherwise. The function 


c(u) = (t1[u > 0) + (1 — t)1[u < 0))|u] = (t — Iu < 0])u 


is called the asymmetric absolute loss function, the t-absolute loss function, or the 
check function (because its graph resembles a check mark) (see, for example, Manski, 
1988, Sect. 4.2.4). The slope of c;(u) is t when u > 0 and —(1 — t) when u < 0; the 
slope is undefined at u = 0. When t = .5, the check function is simply the absolute 
value divided by two, and so it is symmetric about zero. If t > .5, the slope of the loss 
for u > 0 is greater than the absolute value of the slope for u < 0; the opposite holds 
ift<.5. 

It follows immediately that a conditional quantile minimizes the asymmetric ab- 
solute loss function conditional on x;. (Of course, when t = .5, we are back to the fact 
that the median minimizes the absolute error.) Therefore, we can immediately apply 
the analogy principle to obtain consistent estimators of the parameters in equation 
(12.103): 


N 


min > ¢r(yi— a — xp). (12.105) 


ae IR, Be IRS = 
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Under the assumption that 0.(t) = (%(t),B,(t)’)’ is the unique minimizer of 
Efc:( yi — « — x;f)|—it is guaranteed to be a solution—the quantile regression esti- 
mator is consistent under very weak regularity conditions. Note that c,(y; — « — x;f) 
is continuous in the parameters because the check function is continuous. However, 
the check function is not differentiable at zero. 

Before we discuss estimation across different quantiles, it is useful to spend some 
time on the leading case of the conditional median. In many applications, the LAD 
estimator is applied along with OLS, often to supposedly demonstrate the sensitivity 
of OLS to influential observations. It is no secret that OLS, because it minimizes the 
sum of squared residuals, can be sensitive to the inclusion of extreme observations. In 
the context of specific models of data contamination, one can make precise the no- 
tion that OLS is “nonrobust”’ to influential observations, or “outliers.” By contrast, 
the LAD estimator (and quantile estimators more generally) are “robust” to influen- 
tial observations. We need not develop a formal framework for defining robustness to 
outlying data—as in, for example, Huber (1981)—to understand the main point: 
OLS is sensitive to changes in extreme data points because the mean is sensitive to 
changes in extreme values; LAD is insensitive to changes in extreme data points 
because the median is insensitive to changes in extreme values. This point is easy to 
illustrate by selecting three positive integers, computing the mean and median, mul- 
tiplying the largest value by 10, and computing the mean and median again. The 
mean can increase dramatically, while the median will not change. 

The insensitivity of the median to changes in extreme values is desirable, but we 
should not overlook an important point: sometimes, probably more often than not, 
we are interested in partial effects on the conditional mean. If that is the case, then we 
must recognize that LAD does not generally consistently estimate parameters in a 
correctly specified conditional mean. Only least squares does (assuming we rule out 
error distributions with very thick tails). Therefore, one must be very careful in 
attributing differences between LAD and OLS to outliers; there are other reasons the 
estimates may differ significantly. If we define robustness to mean consistently esti- 
mating the parameters of the conditional mean, LAD is not a robust estimator 
of conditional mean parameters because consistency holds only under additional 
restrictions on the conditional distribution. (Other so-called robust estimators— 
where “robust” means insensitivity to outlying observations—are not robust for 
estimating the conditional mean in that they also rely on symmetry for consistency. 
See Huber (1981) for a general treatment and also Peracchi (2001).) 

To study the assumptions under which LAD and OLS estimate the same parame- 
ters, it is helpful to write a model for a random draw i as 
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Vi = Co + XP, + üi. (12.106) 
If we assume 
D(u; | x;) is symmetric about zero, (12.107) 


then E(u; | x;) = Med(u; | x;) = 0, which means that the deterministic part of (12.106), 
čo + x;f,, is both the conditional mean and conditional median of y; (and D(y;|x;) 
is symmetric about &o + x;f,). In discussions of the sensitivity of OLS to outliers, and 
the superiority of LAD under such circumstances, conditional symmetry is often 
maintained, if only implicitly. If we maintain conditional symmetry, then the analysis 
is fairly clean, as the trade-offs between LAD and OLS are readily obtained. Under 
conditional symmetry, LAD can have a smaller asymptotic variance—see the next 
section for derivation—than OLS for fat-tailed distributions. But, as is well known, 
OLS will be more efficient for estimating the mean (median) parameters under certain 
thin-tailed distributions, such as the normal. 

Are there other assumptions under which OLS and LAD are both consistent for 
the parameters in (12.106)? Yes. Suppose that, instead of (12.107), we assume inde- 
pendence and take as the normalization E(u;) = 0: 


Diu; | xi) = D(u;) and E(uj) = 0. (12.108) 


Under (12.108), OLS consistently estimates % and #,. Notice that u; need not have 
a symmetric distribution. When it does not, Med(u;) = n, #0. Nevertheless, by 
independence, 


Med( y; | Xi) = % + XiB. + Med(uj) = (% +9) + XiBo- 


This equation immediately implies that the LAD slope estimators are consistent for 
Po. Therefore, OLS and LAD should provide similar estimates of the slope parame- 
ters. But they will estimate different intercepts. 

Unfortunately, in many applications of LAD, the independence assumption is 
clearly violated. Heteroskedasticity in Var(u;|x;) is common when y; is a variable 
such as wealth, income, or pension contributions. In addition, conditional wealth and 
income distributions tend to be skewed. Therefore, when LAD methods are applied 
alongside OLS, there are often reasons to think a priori that OLS and LAD will not 
produce similar slope estimates. (In fact, it is unlikely that the conditional mean and 
conditional median are both linear in x;.) But important differences in the OLS 
and LAD estimates need have nothing to do with the presence of outliers. 

Sometimes one can use a transformation to ensure conditional symmetry or the 
independence assumption in (12.108). When y; > 0, the most common transforma- 
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tion is the natural log. Often, the linear model log(y;) = % + x;f, + ui is more likely 
to satisfy symmetry or independence. Suppose that symmetry about zero holds in the 
linear model for log(y;). Then, because the median passes through monotonic func- 
tions (unlike the expectation), Med( y; | x;) = exp(Med[log(y,) | x;]) = exp(a% + x:f,), 
and so we can easily recover the partial effects on the median of y; itself. By contrast, 
we cannot generally find E( y; |x;) = exp(% + xiP,)E[exp(u;) | xi]. If instead we as- 
sume (12.108), then Med( y; | x;) and E( y; | X;) are both exponential functions of x;f,, 
but with different “intercepts” inside the exponential function. 

Incidentally, the fact that the median passes through monotonic functions is very 
handy for applying LAD to a variety of problems, including some in Chapters 15 and 
17. But the expectation operator has useful properties that the median does not: lin- 
earity and the law of iterated expectations. For example, suppose we begin with a 
random coefficient model y; = a; + x;b;,where a; is the heterogeneous intercept and b; 
isa 1 x K vector of heterogeneous slopes (“random coefficients”). If we assume that 
(ai, b;) is independent of x;, then 


E(yi| xi) = E(a;| xi) + x;E(b; | x;) = % + xiBo, 


where % = E(a;) and $, = E(b;). Because OLS consistently estimates the parameters 
of a conditional mean linear in those parameters, OLS consistently estimates the 
population averaged effects, or average partial effects, B, (see also Section 4.4.4). 
Generally, even under independence, there is no way to derive Med(y;|x;), and it 
cannot be shown that LAD estimation of a linear model estimates average or median 
partial effects. Angrist, Chernozhukov, and Fernandez-Val (2006) provide a treat- 
ment of LAD (and quantile regression) under misspecification and characterize the 
probability limit of the LAD estimator. 

We now turn to general quantile estimation. In many applications, one estimates 
linear conditional quantile functions for various quantiles. Having a set of esti- 
mated linear quantiles allows one to see how various explanatory variables differ- 
entially affect different parts of the distribution D(y;|x;). Of course, nothing 
guarantees that all or even several of the conditional quantiles are actually linear. 
With additive errors independent of the regressors, we have equation (12.103) but 
with common slopes, $ (t) = f, for all t. Then, we can estimate the common slopes 
using any quantile with 0 < t < 1. In most applications of quantile regression, the 
whole point is to see how the effects of the covariates change with the quantile, and so 
different linear models are estimated for different quantiles. Practically speaking, the 
conditions under which the quantile functions do not cross are quite restrictive, and 
the estimated quantiles often do not show uniformly increasing or decreasing slopes 
as t ranges between zero and one. Koenker (2005) provides further discussion. 
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12.10.2 Asymptotic Inference 


As mentioned in Section 12.3, LAD estimation falls outside Theorem 12.3 because 
the objective function is not twice continuously differentiable with positive definite 
expected Hessian (at 0o). For general quantile regression, the nondifferentiability in 
the check function occurs only at zero. To simplify notation, for a given t we now 
write 


Vi = Xi0_ + üi, Quant, (u; | xi) = 0, (12.109) 


where the first element of x; is unity. Provided there is sufficient variation in the dis- 
tribution of x;, so that the probability that y; — x;6 is zero is sufficiently small, the 
nonsmoothness of c,(u) at u = 0 does not cause a serious problem. The real compli- 
cation is that the second derivative of the check function is zero everywhere it is 
defined, that is, for all u # 0. Interestingly, although the usual mean value expansion 
of the score can no longer be applied, it is possible to modify the argument and ob- 
tain an influence function representation for the quantile regression estimator. If we 
write the objective function as 


q(w;, 0) = t1[y; — x;0 > O|(y; — x0) — (1 — 1) 1, — x18 < O]( 9; — x;8), 


then we can define a score function as 
s:(0) = —x;{tl[y; — x0 = 0] — (1 — t)1[ 9; — x8 < O}}. (12.110) 


We refer to s;(@) as a score function because, for 0 such that y; — x;0 4 0, s;(@) is 
in fact the transpose of the gradient of the objective function, just as before. The 
hope is that we do not have to worry about what happens when the objective 
function is nondifferentiable. If u; has a continuous distribution at zero, then 
P(yi — x; = 0) = 0. In other words, at 0 = ĝo, we can ignore the possibility of 
observing data where the objective function is nondifferentiable at the true value. 
Because is consistent for Oo, as the sample size grows there is less and less chance that 
q(w;, 9) is nondifferentiable at Ô, and so we can use a first-order condition to obtain 0. 

We can show directly that the score satisfies E|s;(@,.)|x;]=0 because 
E(1[y; — x5 = 0} | x;) = P( y; = xi00 |x) = (1 — t) by definition of a quantile. Fur- 
ther, E(1[y; — x;0. < 0] | x;) = P(y; < x;0, |x;) = 7 (when we assume u; has a con- 
tinuous distribution at zero), and so 


E[s;(9o) | xi] = —x;(t(1 — t) — (1 —1)t) = 0. 


The asymptotic theory requires that we consider solutions that satisfy 
NYA s,(8) =0,(1). (Technically, there is a chance that the solution to the 
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minimization problem does not solve the first-order condition, but it diminishes.) The 
key to obtaining an influence function representation is to first compute the expected 
value of the score, then obtain the Jacobian. If we first obtain the Jacobian of 
the score, then it is identically zero. Therefore, let a(@) = E[s;(0)] and assume that the 
P x P Jacobian of a(-), say A(-), exists. Further, assume that A(@,) is nonsingular; in 
fact, it will almost always be positive definite. By first taking the expectation of s;(0), 
we smooth it out, and under weak conditions the expected value of the score is well 
behaved. The starting point is to find the expectation conditional on x;: 


E[s;(8) | x;] = —xj{tP[u; = x;(0 — 05) | xi] — (1 — t) Plu; < x;(@ — 80) | xi] 


= —x}{t[1 — Fu(xi(0 — 8o) | xi)] — (1 — t)Fu(xi(@ — 8o) | xi) } 
= —x;|t — Fi(xi(0 — 80) | xi)], (12.111) 


where we assume that the conditional cumulative distribution function F,,(-|x) is 
continuous at zero. In fact, we assume that F;,(- |x) is continuously differentiable and 
denote the density /,,(-|x). Of course, E]s;(@)] is just the expected value of (12.111) 
across the distribution of x;. Assuming that we can interchange the expectation 
and the Jacobian—which holds under general conditions, as described in Bartle 
(1966, Chap. 4)—we can first compute the Jacobian of E[s;(@)|x,;] and then 
obtain its expected value to obtain A(0). But from (12.111), VoE/s,(@) | xi] = 
SulXi(@ — 85) | Xi)x/x;, and evaluating this Jacobian at 0) and taking the expectation 
(which we assume can be interchanged with the Jacobian) gives 


Ay = A(8o) = EL f,(0|x;)x/x\]. (12.112) 


Using the methods of Huber (1967) and Newey and McFadden (1994, Sect. 7), one 
can derive the representation 


VN(6 — b.) = -A7 N- ys )+0,(1 (12.113) 
Because s;(9,) has zero mean, it follows (under mild conditions) that 

VN(6 — 05) “+ Normal(0, A>!B,Az!), (12.114) 
where 

B, = E[s;(0)8;(90)'] = t(1 — t)E(x!x;). (12.115) 


(Problem 12.14 asks you to derive the expression for B,.) 
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The matrix B, is simple to estimate (recall that we have chosen 1): 


B = (1 — 7) (wax), (12.116) 
i=l 


We can also use the average outer product of the scores, s;(@). This estimator is 
generally consistent when P(y; — x;0. = 0) = 0 by Lemma 12.1 (allowing the func- 
tion to be discontinuous at 0, with probability zero). 

The matrix A, is much more difficult to estimate because, at first glance, it seems to 
require estimation of f,,(0|x;), the conditional density of u; at zero. One approach is 
to use a nonparametric density estimator, but these estimators can be imprecise, es- 
pecially when the dimension of x; is large. It turns out that we can estimate A, more 
simply, using an approach due to Powell (1991). To sketch the approach, we can use 
the assumption that /,(-|x;) is differentiable at zero, along with the definition of a 
derivative, to approximate f;,(0|x;) as 


[F,(hy | x;) — F(—hy | x;)]/2hy = P(—hy < u; < hy | x;)/2hy 
= P(|u;| < hy | x;)/2hn, 


where {/y} is a sequence of positive numbers with Ay — 0. (We will see in a moment 
why we subscript Ay with the sample size, NV.) It is convenient to write the conditional 
probability as a conditional expectation using indicator functions, P(|u;| < hy | x;) = 
E(1[|u;| < hy] | x;). Therefore, an approximation of E[f,,(0 | x;)x;x;] for “small” hy is 


(2h) 'E{ (1 [loi] < Aw] | x/)x}xi} = (2h) EC [lil < hn]x;x:), (12.117) 


where the equality holds by iterated expectations. Switching from the conditional to 
the unconditional expectation of 1[|u;| < hy] is an important simplification, and is 
reminiscent of the argument used to obtain the heteroskedasticity-robust variance 
matrix estimator. In that case, the matrix to be estimated, in the middle of the sand- 
wich, is E[E(u? | x;)x/x;] = E(u?x/x;). The latter expression is much easier to estimate 
directly because it circumvents the need to estimate the conditional expectation 
E(u? | x;). 

Proceeding with the quantile regression case, we estimate (12.117) using the sample 
analogue, as usual: 


N N 
A = (2hy)'N7' XO 1 [dil < Ay]x/xi = (2Nhx) ' XO 1 [dil < hy] x!xi, (12.118) 
i=] i=l 
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where ù; = yi — xô are the residuals from the quantile regression. Unfortunately, 
unlike in previous cases, we cannot so easily assert that A is consistent for Ao: we 
have said nothing about how quickly hy decreases to zero with the sample size, and 
the indicator function is not continuous. Powell (1991) (see also Koenker [2005]) 
shows that, under weak conditions, hy — 0 such that VNhy — œ are sufficient for 
consistency. The second condition controls how quickly hy shrinks to zero. For ex- 
ample, hy = aN~'/? for any a > 0 satisfies these conditions. The practical problem is 
choosing a (or choosing hy more generally). Koenker (2005) contains specific rec- 
ommendations, and also discusses related estimators. In equation (12.118), observa- 
tion i does not contribute if |ù;| > Ay. Other methods allow each observation to enter 
the sum but with a weight that declines as |#;| increases. In practice, one uses the most 
convenient estimate (that may be programmed into an existing econometrics package 
that performs quantile regression). The nonparametric bootstrap can be applied to 
quantile regression, but if the data set is large, the computation using several hundred 
bootstrap samples can be costly. 

If we assume that u; is independent of x; then f,,(0|x;) = f,(0) and the asymptotic 
variance in equation (12.114) simplifies to 


t(l — 1) 
[fa (0)]? 


its estimator has the general form 
t(1 — 1) 
A rm2 


N —1 
NN xxi), 12.120 
[fa (0)] ( 3 ' ) ! l 


and a simple, consistent estimate of f,,(0) is 


[E(x;x;)] '; (12.119) 


f,(0) = (2Nhy) `! 3 1lló:| < hw], (12.121) 


i=1 


where Ay satisfies the same conditions as before. The estimator in (12.121) is easily 
seen to be a simple histogram estimator of the density of u; at u = 0, where the bin 
width is 2hy (and we must use the residuals in place of u;). Alternatively, one can use 
a different kernel density estimator applied to {a}; see, for example, Cameron and 
Trivedi (2005, Sect. 9.3). 


Example 12.1 (Quantile Regression for Financial Wealth): We use the data set on 
single individuals (fsize = 1) in the data file 401KSUBS.RAW (from Abadie (2003)) 


458 Chapter 12 


Table 12.1 
Mean and Quantile Regression for Net Total Financial Wealth 
Dependent 
Variable nettfa 
(1) (2) (3) (4) (5) (6) 
Explanatory Median 90 
Variable Mean (OLS) .10 Quantile .25 Quantile (LAD) .75 Quantile Quantile 
inc .783 —.0179 0713 .324 .798 1.291 
(.104) (.0177) (.0072) (.012) (.025) (.048) 
age —1.568 —.0663 .0336 —.244 —1.386 —3.579 
(1.076) (.2307) (.0955) (.146) (.287) (.501) 
age? .0284 .0024 .0004 .0048 .0242 .0605 
(.0138) (.0027) (.0011) (.0017) (.0034) 
(.0059) 
e401k 6.837 949 1.281 2.598 4.460 6.001 
(2.173) (.617) (.263) (.404) (.801) (1.437) 
N 2,017 2,017 2,017 2,017 2,017 2,017 


The OLS standard errors (in parentheses) are robust to heteroskedasticity. 

The quantile regression estimates, along with standard errors (in parentheses), were obtained using Stata 
9.0. The variance matrix is of the form in equation (12.120)—that is, it assumes independence between the 
error and regressors. 


to estimate conditional quantiles for net total financial wealth (nettfa). The explana- 
tory variables are income, age, and a binary variable indicating eligibility to partici- 
pate in a 401(k) pension plan through one’s employer. The estimates, including OLS 
estimates of a linear model for the conditional mean, are given in Table 12.1. Fi- 
nancial wealth and income are both measured in thousands of dollars. 

There are no surprises in the OLS estimates. The mean relationship between 
financial wealth and income is strong and very statistically significant. Further, eli- 
gibility in a 401(k) plan, holding income and age fixed, is estimated to increase 
expected financial wealth by about $6,800. The heteroskedasticity-robust ¢ statistic is 
over three. The coefficients on age and age? may appear puzzling at first, but they 
imply an increasing effect on nettfa starting at age = 27.6. This makes sense because 
the youngest person in the sample is 25. (In fact, the fit of the model is hardly 
changed—the R-squared goes from .1273 to .1272—if we replace age and age? with 
the single explanatory variable (age — 25)’, and the restriction certainly cannot be 
rejected.) 

The picture of the effects of income and 401(k) eligibility is very different when we 
look across the wealth distribution. How do we interpret the coefficient on inc for the 
quantile regressions? Consider the median regression result. Holding age and e401k 
fixed, the coefficient on inc implies that if we compare two groups of people whose 
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income differs by $1,000, the median financial wealth is estimated to be $324 higher 
for the group with higher income. The effect on the median is less than half of the 
effect on the mean ($783). The effect of income at the low end of the nettfa distribu- 
tion, the .10 quantile, is nonexistent. Income has a large effect at the upper end of the 
financial wealth distribution. For example, increasing income by $1,000 dollars 
increases the .90 quantile of nettfa by $1,291, or more than $1,000. The coefficient on 
inc is statistically greater than one. 

The effects of 401(k) eligibility on nettfa also increase as we move up the wealth 
distribution, even conditional on income. Wealthy people can afford to contribute the 
maximums allowable by law, and so the option of contributing to tax-deferred sav- 
ings plans, such as 401(k) plans, leads to a larger effect as we move up the distribu- 
tion. (The coefficient on e401k increases to about 9.7 at the .95 quantile.) 


12.10.3 Quantile Regression for Panel Data 


Quantile regression methods can be applied to panel data, too. For a given quantile 
0 <t< 1, suppose we specify 


Quant, (Yi | Xu) = XiDo, f= 1,...,7, (12.122) 


where x; probably allows for a full set of time period intercepts. Of course, we can 
write Vir = XubÂo + uj, where Quant,(uj | Xu) = 0. The natural estimator of 0, is the 
pooled quantile regression estimator, 0, which solves 


N T 


min SOYO cel Vie — Xð), (12.123) 


Oe ey. eI 


where c,(+) is the check function. Now, when we define a score s;(0) = Y> Z4 sic(0), we 
generally have to account for serial correlation in s;;(0.) (although see Problem 12.15 
for the case of dynamically complete quantiles). One technical issue in using the outer 
product of the score to estimate Bo, that is, 


B= N'Y ‘s,(6)s,(0)' = N~! 3 > sir(9)si-(8)’ , (12.124) 


is that the score function is discontinuous for @ such that y;, = x,,0, for some 
te {l,..., 7}. Neverthless, if we assume P(u;, #0,...,u;r #0) = 1, then s;(0) is 
continuous at 0, with probability one, and this is enough to apply a slightly gener- 
alized version of Lemma 12.1. Because Ê includes the terms s;,()s;.(0)' for t # r, 


B accounts for any kind of neglected dynamics in Quant,(y;|x,,). The terms 
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si(9)si(8)' could be simplified along the lines of (12.116), but it is unecessary and 
seems pointless when the terms for t 4 r are included. 

Estimation of A, is similar to the cross section case. A fully robust estimator, one 
that does not assume independence between u; and X; and allows the distribution of 
uj, to change across f, is an extension of equation (12.118): 


No. T 
Â = (2Nhy) XO XO Mle < hy] xixir, (12.125) 


il t=1 


or, we can replace the indicator function with a smoothed version. Rather than using 
Â-IBÂT!/N as the estimate of Avar(0), the bootstrap can be applied using the 
method for panel data described in Section 12.8.2. 

Allowing explicitly for unobserved effects in quantile regression is trickier. For a 
given quantile 0 < t < 1, a natural specification that incorporates strict exogeneity 
conditional on c; is 


Quant, (Vir | Xi, ci) = Quant, (Vir | Xir, ci) = Xgbo + Ci, t= lprssgti (12.126) 


which is reminiscent of the way we specified the conditional mean in Chapter 10. 
Equivalently, we can write 


Vit = XitOo + Ci + Uir, Quant, (uit | Xi, ci) = 0, t=1,...,T. 


Unfortunately, unlike in the case of estimating effects on the conditional mean, we 
cannot proceed without further assumptions. A “fixed effects” approach, where we 
allow D(c; | x;) to be unrestricted, is attractive. Generally, there are no simple trans- 
formations to eliminate c; and estimate 0,. If we treat the c; as parameters to estimate 
along with 0%, the resulting estimator generally suffers from an incidental parameters 
problem, a topic that comes up in Chapter 13 and at several places in Part IV. 
Briefly, if we try to estimate c; for each 7, then, with large N and small T, the poor 
quality of the estimates of c; causes the accompanying estimate of 0) to be badly 
behaved. Recall that this was not the case when we used the FE estimator for a con- 
ditional mean: treating the c; as parameters led us to the within estimator. Koenker 
(2004) derives asymptotic properties of this estimation procedure when T grows 
along with N, but he adds the assumptions that the regressors are fixed and 
{uj : t= 1,..., T} is serially independent. 

An alternative approach is suggested by Abrevaya and Dahl (2008) for T = 2. 
Motivated by Chamberlain’s approach to linear unobserved effects models (see 
Section 11.1.2), Abrevaya and Dahl estimate separate quantile regressions 
Quant (yi | Xi, X2) (with intercepts, of course) for t = 1,2. They define the partial 
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effects in a way that mimics a representation of the partial effects in Chamberlain’s 
correlated random effects (CRE) approach. 

For quantile regression, CRE approaches are generically hampered because find- 
ing quantiles of sums of random variables is difficult. For example, suppose we 
impose the Mundlak representation c; = W, + X;čo + a;. Then we can write yi = 
Wo + X80 + Xičo + ai + Uit = Vie = Wo + Xho + Xičo + Viz, Where v; is the composite 
error. Now, if we assume v; is independent of x;, then we can estimate 0, and č, 
using pooled quantile regression of yi on 1, xj, and x;. (The intercept does not esti- 
mate a quantity of particular interest.) But independence is very strong, and, if we 
truly believe it, then we probably believe all quantile functions are parallel. Of course, 
we can always just assert that the effect of interest is the set of coefficients on X; in the 
pooled quantile estimation, and we allow these, along with the intercept and coef- 
ficients on X;, to change across quantiles. The asymptotic variance matrix estimator 
discussed for pooled quantile regression applies directly once we define the explana- 
tory variables at time ¢ to be (1, xi, X;). 

We have more flexibility if we are interested in the median, and a few simple 
approaches suggest themselves. Write the model Med( yi; | x;, ci) = Med( yi | Xir, ci) = 
x9, + ci in error form as 


Vit = Xitbo + Ci + Uit, Med (wiz | Xi, ci) = 0, t= | reece’ Bc 


and consider the multivariate conditional distribution D(u; | x;). If this distribution is 
symmetric about zero in the sense that D(u; | x;) = D(—u;|x;)—which is sometimes 
called centrally symmetric—then the distribution of g'u; given x; is symmetric about 
zero for any linear combination g (see, for example, Serfling (2006) for discussion). In 
particular, the time-demeaned errors ü; have (univariate) conditional distributions 
symmetric about zero, which means we can consistently estimate 0, by applying 
pooled least absolute deviations to the time-demeaned equation jj, = X05 + ür, being 
sure to obtain fully robust standard errors by using equations (12.124) and (12.125) 
on the time-demeaned data. 

Alternatively, under the centrally symmetric assumption, the difference in the 
errors, AUi = Uit — Uj,r-1, have symmetric distributions about zero, so one can apply 
pooled LAD to Ay; = Axi0, + Aui, t = 2,..., T. From Honoré (1992) applied to the 
uncensored case, LAD on the first differences is consistent when {u;,: t = 1,...,7} is 
an i.i.d. sequence conditional on (x;, c;), even if the common distribution is not sym- 
metric, and this may afford robustness for LAD on the first differences rather than on 
the time-demeaned data. (Interestingly, it follows from the discussion in Honoré 
(1992, Appendix 1) that when T = 2, applying LAD on the first differences is equiv- 
alent to estimating the c; along with 0). So, in this case, there is no incidental 
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parameters problem in estimating the c; as long as un — uj, has a symmetric distri- 
bution.) Although not an especially weak assumption, central symmetry of D(u; | x;) 
allows for serial dependence and heteroskedasticity in the u; (both of which can de- 
pend on x; or on #). As always, we should be cautious in comparing the pooled OLS 
and pooled LAD estimates of 0) on the demeaned or differenced data because they 
are only expected to be similar under the conditional symmetry assumption. 

If we impose the Mundlak device, we can get by with conditional symmetry of 
a sequence of bivariate distributions. Write yi = Wo + Xlo + Xi€ + a + uin, Where 
Med (uj | Xi, ai) = 0. If D(a;, ui |x;) has a symmetric distribution around zero, then 
D(a; + ui | x;) is symmetric about zero, and, if this holds for each t, pooled LAD of 
Yi on 1, x, and x; consistently estimates (W,, 0o, čo). (Therefore, we can estimate 
the partial effects on Med( y; | Xx, ci) and also test if c; is correlated with x;.) The 
assumptions used for this approach are not as weak as we would like, but, as in using 
pooled LAD on the time-demeaned data, adding x; to pooled LAD gives a way to 
compare with the usual FE estimate of 0,. (Remember, if we use pooled OLS with x; 
included, we obtain the FE estimate.) Fully robust inference can be obtained by 
computing B and A in (12.124) and (12.125), respectively. 


Problems 


12.1. a. Use equation (12.4) to show that 0, minimizes E{[y — m(x, 0)]* |x} over © 
for any x. 

b. Explain why the result in part a is stronger than stating that 0, solves problem 
(12.3). 


12.2. Consider the model E(y| x) = m(x,0,), Var(y|x) = exp(a + Xy), where x 
is 1 x K. The vector 0, is P x 1 and y, is K x 1. 


a. Define u = y — E(y|x). Show that E(u? |x) = exp(% + xy,). 


b. Let ĉ; denote the residuals from estimating the conditional mean by NLS. Argue 
that æo and y, can be consistently estimated by a nonlinear regression where i? is the 
dependent variable and the regression function is exp(% +xy,). (Hint: Use the 
results on two-step estimation.) 

c. Using part b, propose a (feasible) weighted least squares procedure for estimat- 
ing Oo. 

d. If the error u is divided by [Var(u|x)]!/?, we obtain v = exp[—(% + xy,) /2]u. 
Argue that if v is independent of x, then y, is consistently estimated from the re- 
gression log(#?) on 1, x; i= 1,2,...,N. (The intercept from this regression will 
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not consistently estimate æo, but this fact does not matter, since exp(% + xy,) = 
a2 exp(xy,), and a2 can be estimated from the WNLS regression.) 

e. What would you do after running WNLS if you suspect the variance function is 
misspecified? 


12.3. Consider the exponential regression function m(x, 0) = exp(x0), where x is 
1x K. 


a. Suppose you have estimated a special case of the model, E(y|z) = exp[ĝ, + 
Oy log(z1) + 6322], where zı and z2 are the conditioning variables. Show that 6> is 
approximately the elasticity of E(y |z) with respect to z. 

b. In the same estimated model from part a, how would you approximate the per- 
centage change in E(y |z) given Az) = 1? 

c. Now suppose a square of z2 is added: Ê(y |z) = exp[ĝ; + 6, log(z1) + 6329 + 
6425], where 03; > 0 and 04 < 0. How would you compute the value of z2 where the 
partial effect of z2 on E(y|z) becomes negative? 

d. Now write the general model as exp(x0) = exp(x10) + x202), where x; is 1 x Kı 
(and probably contains unity as an element) and x2 is 1 x Ky. Derive the usual 
(nonrobust) and heteroskedasticity-robust LM tests of Ho : 0.2 = 0, where 0, indexes 
E(y|x). 

12.4. a. Show that the score for WNLS is s;(0; y) = —Vam(x;, 0)'u;(0)/h(x;, 7). 

b. Show that, under Assumption WNLS.1, E/s;(0,; y) | x;] = 0 for any value of y. 

c. Show that, under Assumption WNLS.1, E[V,s;(6o; y)] = 0 for any value of y. 


d. How would you estimate Avar(@) without Assumption WNLS.3? 
e. Verify that equation (12.59) is valid under Assumption WNLS.3. 


12.5. a. For the regression model 
m(x, 0) = G[xB + 61(xp)° + 52(xB)’], 


where G(-) is a known, twice continuously differentiable function with derivative g(-), 
derive the standard LM test of Ho: 692 = 0, 693 = 0 using NLS. Show that, when G(-) 
is the identity function, the test reduces to RESET from Section 6.2.3. 


b. Explain how to implement the variable addition version of the test. 
12.6. Consider a panel data model for a random draw i from the population: 


Vie = MXit, Oo) + tit, E(ujr | Xit) = 0, t= Tyee Ds 
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a. If you apply pooled nonlinear least squares to estimate 0), how would you esti- 
mate its asymptotic variance without further assumptions? 


b. Suppose that the model is dynamically complete in the conditional mean, so 
that E(uit|Xi,Ui1-1,Xit-1,---) =0 for all ż. In addition, E(u? |x;,) = 02. Show 
that the usual statistics from a pooled NLS regression are valid. (Hint: The objective 
function for each i is q;(0) = So), [y,—m(xi,0)|°/2 and the score is s;(0) = 
— S05, Von(xir, 9)'uj(0). Now show that B, = o2A, and that o2 is consistently esti- 
mated by (NT — P) YX, OE, #2.) 

c. For the mean model m/(x;;, 8,6), consider testing Ho : ôo = 6. Show that under 
dynamic completeness and homoskedasticity (under Ho), a valid version of the LM 
statistic is obtained as NTR? where RŽ is the uncentered R-squared from the pooled 
OLS regresssion 


i ON Vemiz, Vomit, C=A cdg TT = Tine, V5 

where tj; = yin — m(Xir,B,6) and the gradients are evaluated at (B, ô). 

12.7. Consider a nonlinear analogue of the SUR system from Chapter 7: 

E( Vig | Xi) = E( Vig | Xig) = Mg(Xig, Dog), g= lyin G. 

Thus, each 0oy can be estimated by NLS using only equation g; call these 6,. Suppose 


also that Var(y; | x;) = Qs, where Q, is G x G and positive definite. 


a. Explain how to consistently estimate Q, (as usual, with G fixed and N — oo). Call 
this estimator Q. 


b. Let Ê be the nonlinear SUR estimator that solves 
N 
1 = $ ' À = . = $ 
min > [y; — m(x;, 8)] Q7 [y; — m(x;, 0)]/2, 


where m(x;, 0) is the G x 1 vector of conditional mean functions and y; is G x 1. 
Show that 


Avar VN(8 — 05) = {E[Vom(x;, 9.)'Q5!Vom(x;, 9o)]} +. 


(Hint: Under standard regularity conditions, N~'/2 7", Vom(x;, 0o) '®- y; — 
m(x; Oo)| = N2 EA Vom(x;, 00) Q7! fy; — m(x;, 9o)] + Op(1).) 


c. How would you estimate Avar(0)? 

d. If Q, is diagonal and if the assumptions stated previously hold, show that NLS 
equation by equation is just as asymptotically efficient as the nonlinear SUR 
estimator. 
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e. Is there a nonlinear analogue of Theorem 7.7 for linear systems in the sense that 
nonlinear SUR and NLS equation by equation are asymptotically equivalent when 
the same explanatory variables appear in each equation? (Hint: When would 
Vom(x;, 0o) have the form needed to apply the hint in Problem 7.5? You might try 
E(y, |x) = exp(xOo,) for all g as an example.) 


12.8. Consider the M-estimator with estimated nuisance parameter f, where 
VN(p—,) = O,(1). If assumption (12.37) holds under the null hypothesis, show 
that the QLR statistic still has a limiting chi-square distribution, assuming also 
that Aj = Bo. (Hint: Start from equation (12.76) but where VN (6 — 6) = AZ! N~! 
Ph s;(0; f) + op(1). Now use a mean value expansion of the score about (Ô, y,) 
to show that VN(6 — 6) = AIN EN s:(ð; ya) + op(1).) 


12.9. For scalar y, suppose that y = m(x, $o) + u, where x is a 1 x K vector. 


a. If E(u| x) = 0, what can you say about Med( y | x)? 
b. Suppose that u and x are independent. Show that E(y |x) — Med(y|x) does not 
depend on x. 


c. What does part b imply about dE(y|x)/0x; and ô Med(y |x) /0x;? 


12.10. For each i, let y; be a nonnegative integer with a conditional binomial dis- 
tribution with upper bound n; (a positive integer) and probability of success p(x;, B,), 
where 0 < p(x,f) <1 for all x and $. (A leading case is the logistic function.) 
Therefore, E(y;|x;,7i) = nip(xi,B,) and Var(y;|xi,m:) = nipi Bo) — p(x, Bo). 
Explain in detail how to obtain the weighted nonlinear least squares estimator of f,. 
12.11. a. Derive equation (12.98) for the WMNLS estimator. You can use equa- 
tion (12.97) to show that estimation of y* can be ignored in obtaining the limiting 
distribution of VN (0 — 9). 

b. If assumption (12.96) holds (and 9 is V N-consistent for y,), what is the asymptotic 
variance of VN(Ô — 0,)? 

c. Explain how you would compute the QLR statistic for testing hypotheses about 


0, under assumption (12.96). Naturally, you should assume the mean is correctly 
specified. 


12.12. Let y; be a scalar response, and let x; and w; be vectors. Suppose that 
E(yi| Xi, Wi) = (Xi, V(Wi, do), Oo), 


where v(w, 6) is a known function of w and the J parameters 6 and @, are a P x 1 
vector. Assume that we have a //N-asymptotically normal estimator of ôs that 
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satisfies 
‘: N 
VN(6—- mae. (wi, ĉo) + 0,(1). 


Let 6 be the two-step NLS estimator that solves 


N 


min [yi — m(x;, (Wy, ô), 0)? /2. 


i=l 
a. Show that, under standard regularity conditions, the asymptotic variance of 
VN (ô — 0) is at least as large as the asymptotic variance if we knew do. 

b. Propose a consistent estimator of Avar VN (Ô — 0%). 

12.13. Let {(xi:, vir): t= 1,..., T} be a panel data set, and assume that for some 
0, € OCR”, El yi |X) = M(X, 0o), t= 1,..., T. Further, let h(x;,,y) be a model 
for Var( yu | Xx) for each t. Generally, let 7 be a V/N-consistent estimator for some y*, 
where /(x;,,7*) need not equal Var( yi | Xir). 


a. Let Ê denote the pooled weighted nonlinear least squares estimator. Is strict exo- 
geneity of {Xx} needed for consistency of 0 for 0,? Explain. 


b. Propose a consistent estimator of Avar VN (Ô — 0.) without further assumptions. 
c. If the mean is dynamically complete, how would you estimate Avar VN (Ô — 0,)? 
d. If the mean is dynamically complete and Var( yi; | Xi) = = 02h(Xit, Ya) and ĵ is VN- 
consistent for y,, how would you estimate Avar VN (Ô — 0.)? 


12.14. Derive equation (12.115). (Hint: The product of the two indicator functions 
appearing in s;(0,) is identically zero.) 


12.15. Let {(xi:, Yu): t= 1,..., T} be a panel data set and assume that, for a given 
quantile 0 < t < 1, Quant,( yi | xi) = x00, f= 1,..., T. Let 0 be the pooled quan- 
tile regression estimator discussed in Section 12.10.3. 


a. Write down the approximate first-order condition solved by Ê. In particular, de- 
fine a suitable “score” for each ¢ and then for a random draw i. 


b. Show that if the quantile is dynamically complete in the sense that 
Quant, (Vir | Xir, Yi,t-1; Xi,r-1, «++ VAX) = Quant, (Vir | Xx) = XiDo, PS Nesey La 


then Bo = Els;(0o)s;(0o)'] = t(1 — 1) X£; E(x!,xir). How would you estimate Bo in 
this case? 
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c. Show that, whether or not the quantile is dynamically complete, 


T 
Ao = >> El fi,(0 | Xie) xX, 


where f,,,(-|x,) is the density of uj, given xj; = X. 


12.16. Consider a linear model with an endogenous explanatory variable, y2, along 
with a reduced form for y2: 


yı = 20, + a y2 + uy 
y2 = 1M) + 02, 


where z = (z1, z2) (with first element of zı unity) and m = (x5,,7},)'; for notational 
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simplicity, we do not use “o”’ subscripts on the true parameters. 
a. Under the assumption Med(v2 | z) = 0, how would you estimate 7? 


b. Suppose that Med(w; | y2,z) = Med(w | v2) = pı v2. Propose a two-step consistent 
estimator of ô; and a. 


c. How would you state the null hypothesis that y2 is exogenous, and how would you 
test it? 


d. Generally, how would you obtain legitimate standard errors for 6, and 1? 


e. If the bivariate conditional distribution D(u1, v2 | z) satisfies the centrally symmet- 
ric condition D(u, v2 |z) = D(—u1, —v2 | z) (and all variables have at least finite sec- 
ond moments), explain why we should expect the procedure from part b and the 
usual 2SLS estimator to provide similar estimates in large samples. 


12.17. Let Ê be an M-estimator of O, with score s;(0) = s(w;, 0) and expected Hes- 
sian A, (evaluated at 0%). Let g(w,@) be an M x 1 function of the random vector 
w and the parameter vector, and suppose we wish to estimate do = E[g(wi, 0o)]. 
The natural estimator is ô= N! ÐA, g(w:,0). Assume that VN(Ô-— 9.) = 
Normal(0, A, 'B,A>') where By = E[s;(95)s;(Oo)'], as usual. 

a. Assuming that g(w,-) is continuously differentiable on int(@), 0, € int(@®), and 
other regularity conditions, find Avar /N(6 — ôo). (Hint: The asymptotic variance 
depends on 6, and G, = E[Voeg(w;, 0o)].) 

b. How would you consistently estimate Avar VN (ô — ôo)? 


c. Show that if g(w,0) = g(x,@), where x is exogenous in the estimation prob- 


lem used to obtain Ô (so that E[s(w;,00)|xi]=0), then Avar VN (6 — ô») = 
Var[g(x;, 0)] + Go[Avar VN (6 — 0,)|G). 


l 3 Maximum Likelihood Methods 


13.1 Introduction 


This chapter contains a general treatment of maximum likelihood estimation (MLE) 
under random sampling. All the models we considered in Part I could be estimated 
without making full distributional assumptions about the endogenous variables 
conditional on the exogenous variables: maximum likelihood methods were not 
needed. Instead, we focused primarily on zero-covariance and zero-conditional-mean 
assumptions, and secondarily on assumptions about conditional variances and co- 
variances. These assumptions were sufficient for obtaining consistent, asymptotically 
normal estimators, some of which were shown to be efficient within certain classes of 
estimators. 

Some texts on advanced econometrics take MLE as the unifying theme, and then 
most models are estimated by maximum likelihood. In addition to providing a 
unified approach to estimation, MLE has some desirable efficiency properties: it is 
generally the most efficient estimation procedure in the class of estimators that use 
information on the distribution of the endogenous variables given the exogenous 
variables. (We formalize the efficiency of MLE in Section 14.4.) So why not always 
use MLE? 

As we saw in Part I, efficiency usually comes at the price of nonrobustness, and this 
is certainly the case for maximum likelihood. Maximum likelihood estimators are 
generally inconsistent if some part of the specified distribution is misspecified. As an 
example, consider from Section 9.5 a simultaneous equations model that is linear in 
its parameters but nonlinear in some endogenous variables. There, we discussed esti- 
mation by instrumental variables methods. We could estimate SEMs nonlinear in 
endogenous variables by maximum likelihood if we assumed independence between 
the structural errors and the exogenous variables and if we assumed a particular dis- 
tribution for the structural errors, say, multivariate normal. The MLE would be 
asymptotically more efficient than the best GMM estimator, but failure of normality 
generally results in inconsistent estimators of all parameters. 

As a second example, suppose we wish to estimate E(y|x), where y is bounded 
between zero and one. The logistic function, exp(xf)/{1 + exp(xf)], is a reasonable 
model for E(y|x), and, as we discussed in Chapter 12, nonlinear least squares 
provides consistent, /N-asymptotically normal estimators under weak regularity 
conditions. We can easily make inference robust to arbitrary heteroskedasticity in 
Var(y|x). An alternative approach is to model the density of y given x—which, of 
course, implies a particular model for E(y|x)—and use MLE. As we will see, the 
strength of MLE is that, under correct specification of the density, we would have 
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the asymptotically efficient estimators, and we would be able to estimate any feature 
of the conditional distribution, such as P(y = 1|x). The drawback is that, except in 
special cases, if we have misspecified the density in any way, we will not be able to 
consistently estimate the conditional mean. 

In most applications, specifying the distribution of the endogenous variables con- 
ditional on exogenous variables must have a component of arbitrariness, as economic 
theory rarely provides guidance. Our perspective is that, for robustness reasons, it 
is desirable to make as few assumptions as possible—at least until relaxing them 
becomes practically difficult. There are cases in which MLE turns out to be robust to 
failure of certain assumptions, but these must be examined on a case-by-case basis, a 
process that detracts from the unifying theme provided by the MLE approach. (One 
such example is nonlinear regression under a homoskedastic normal assumption; the 
MLE of the parameters £, is identical to the NLS estimator, and we know the latter 
is consistent and asymptotically normal quite generally. We will cover some other 
leading cases in Section 13.11 and Chapter 18.) 

Maximum likelihood plays an important role in modern econometric analysis, for 
good reason. There are many problems for which it is indispensable. For example, 
in Chapters 15, 16, and 17 we study various limited dependent variable models, and 
MLE plays a central role. 


13.2 Preliminaries and Examples 


Traditional maximum likelihood theory for independent, identically distributed 
observations {y, e R° :i=1,2,...} starts by specifying a family of densities for y,. 
This is the framework used in introductory statistics courses, where y; is a scalar with 
a normal or Poisson distribution. But in almost all economic applications, we are 
interested in estimating parameters of conditional distributions. Therefore, we assume 
that each random draw is partitioned as (x;, y;), where x; € IR“ and y; e IR°, and we 
are interested in estimating a model for the conditional distribution of y; given x;. We 
are not interested in the distribution of x;, so we will not specify a model for it. 
Consequently, the method of this chapter is properly called conditional maximum 
likelihood estimation (CMLE). By taking x; to be null we cover unconditional MLE 
as a special case. 

An alternative to viewing (x;,y;) as a random draw from the population is to treat 
the conditioning variables x; as nonrandom vectors that are set ahead of time and that 
appear in the unconditional distribution of y;. (This setup is analogous to the fixed 
regressor assumption in classical regression analysis.) Then the y, cannot be identi- 
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cally distributed, and this fact complicates the asymptotic analysis. More important, 
treating the x; as nonrandom is much too restrictive for all uses of maximum likeli- 
hood. In fact, later on we will cover methods where x; contains what are endogenous 
variables in a structural model, but where it is convenient to obtain the distribution of 
one set of endogenous variables conditional on another set. Once we know how to 
analyze the general CMLE case, applications follow fairly directly. 

It is important to understand that the subsequent results apply any time we have 
random sampling in the cross section dimension. Thus, the general theory applies to 
system estimation, as in Chapters 7 and 9, provided we are willing to assume a dis- 
tribution for y; given x;. In addition, panel data settings with large cross sections and 
relatively small time periods are encompassed, since the appropriate asymptotic 
analysis is with the time dimension fixed and the cross section dimension tending to 
infinity. 

In order to perform maximum likelihood analysis we need to specify, or derive 
from an underlying (structural) model, the density of y; given x;. We assume this 
density is known up to a finite number of unknown parameters, with the result that 
we have a parametric model of a conditional density. The vector y; can be con- 
tinuous or discrete, or it can have both discrete and continuous characteristics. In 
many of our applications, y; is a scalar, but this feature does not simplify the general 
treatment. 

We will carry along two examples to illustrate the general theory of conditional 
maximum likelihood. The first example is a binary response model, specifically the 
probit model. We postpone the uses and interepretation of binary response models 
until Chapter 15. 


Example 13.1 (Probit): Suppose that the latent variable y; follows 
yi =xO+ e; (13.1) 


where e; is independent of x; (which is a 1 x K vector with first element equal to unity 
for all i), 0 is a K x1 vector of parameters, and e; ~ Normal(0,1). Instead of 
observing y;, we observe only a binary variable indicating the sign of y;: 


fi ifys>0 (13.2) 
10 ify <0 (13.3) 


To be succinct, it is useful to write equations (13.2) and (13.3) in terms of the indi- 
cator function, denoted 1[-]. Recall that this function is unity whenever the state- 
ment in brackets is true, and zero otherwise. Thus, equations (13.2) and (13.3) are 
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equivalently written as y; = 1[y¥ > 0]. Because e; is normally distributed, it is irrele- 
vant whether the strict inequality is in equation (13.2) or (13.3). 
We can easily obtain the distribution of y; given x;: 


P(y; = 1 | x;) = P(v; > 0|x;) = P(x;0 + e; > 0|x;) 
= P(e; > —x;0|x;) =l1- ®(—x;0) = O(x;0), (13.4) 


where ®(-) denotes the standard normal cumulative distribution function (cdf). We 
have used Property CD.4 in the chapter appendix along with the symmetry of the 
normal distribution. Therefore, 


P(y; = 0|x;) = 1 — ®(x,0). (13.5) 
We can combine equations (13.4) and (13.5) into the density of y; given x;: 
f(y | xi) = [®(%i)?' [1 — B(x)",  y=0,1. (13.6) 


That f(y|x;) is zero when y ¢ {0,1} is obvious, so we will not be explicit about this 
in the future. 


Our second example is useful when the variable to be explained takes on non- 
negative integer values. Such a variable is called a count variable. We will discuss the 
use and interpretation of count data models in Chapter 18. For now, it suffices to 
note that a linear model for E(y|x) when y takes on nonnegative integer values is 
not ideal because it can lead to negative predicted values. Further, since y can take on 
the value zero with positive probability, the transformation log( y) cannot be used to 
obtain a model with constant elasticities or constant semielasticities. A functional 
form well suited for E(y | x) is exp(x0). We could estimate 0 by using NLS, but all of 
the standard distributions for count variables imply heteroskedasticity (see Chapter 
18). Thus, we can hope to do better. A traditional approach to regression models 
with count data is to assume that y; given x; has a Poisson distribution. 


Example 13.2 (Poisson Regression): Let y; be a nonnegative count variable; that is, 
y; can take on integer values 0,1,2,.... Denote the conditional mean of y, given the 
vector x; as E(y;|x;) = u(x;). A natural distribution for y; given x; is the Poisson 
distribution: 


f(y |x) = exp[—u(x,)]{u(xi) }?/y!, y= 0, 1,2,- (13.7) 


(We use y as the dummy argument in the density, not to be confused with the random 
variable y;.) Once we choose a form for the conditional mean function, we have 
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completely determined the distribution of y; given x;. For example, from equation 
(13.7), P(y; = 0|x;) = exp[—y(x,)]. An important feature of the Poisson distribu- 
tion is that the variance equals the mean: Var(y,|x;) = E(y; | x;) = “(x;). The usual 
choice for u(-) is u(x) = exp(x0), where 0 is K x 1 and x is 1 x K with first element 
unity. 


13.3 General Framework for Conditional Maximum Likelihood Estimation 


Let p,(y|x) denote the conditional density of y; given x; = x, where y and x are 
dummy arguments. We index this density by “o” to emphasize that it is the true 
density of y; given x;, and not just one of many candidates. It will be useful to let 
X c RX denote the possible values for x; and Y denote the possible values of y,; X 
and ¥ are called the supports of the random vectors x; and y,, respectively. 

For a general treatment, we assume that, for all x€ 2, p,(-|x) is a density with 
respect to a a-finite measure, denoted v(dy). Defining a o-finite measure would take 
us too far afield. We will say little more about the measure v(dy) because it does 
not play a crucial role in applications. It suffices to know that v(dy) can be chosen to 
allow y; to be discrete, continuous, or some mixture of the two. When y; is discrete, 
the measure v(dy) simply turns all integrals into sums; when y; is purely continuous, 
we obtain the usual Riemann integrals. Even in more complicated cases—where, say, 
y; has both discrete and continuous characteristics—we can get by with tools from 
basic probability without ever explicitly defining v(dy). For more on measures and 
general integrals, you are referred to Billingsley (1979) and Davidson (1994, Chaps. 3 
and 4). 

In Chapter 12 we saw how NLS can be motivated by the fact that u, (x) = E(y |x) 
minimizes E{[y — m/(x)]°} for all other functions m(x) with E{[m(x)]?} < œ. Con- 
ditional maximum likelihood has a similar motivation. The result from probability 
that is crucial for applying the analogy principle is the conditional Kullback-Leibler 
information inequality. Although there are more general statements of this inequality, 
the following suffices for our purpose: for any nonnegative function f(-|x) such that 


| T(y| x)v(dy) = 1, allxe 2, (13.8) 


Y 


Property CD.1 in the chapter appendix implies that 


X (f;x) = |, log[po(y| x) /F(y|x)]Poly|x)v(dy) 20, allxe 2. (13.9) 
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Because the integral is identically zero for f = po, expression (13.9) says that, for 
each x, “#(f;x) is minimized at f = po. 
We can apply inequality (13.9) to a parametric model for p,(- |x), 


{f(-|x;0), 0€ O, Oc RP}, (13.10) 


which we assume satisfies condition (13.8) for each x e X and each 0 e @; if it does 
not, then f(-|x;@) does not integrate to unity (with respect to the measure v), and as 
a result it is a very poor candidate for p,(y |x). Model (13.10) is a correctly specified 
model of the conditional density, p,(- |-), if, for some 0) € O, 


SC|; 8) = po(:|x), — allxe 2. (13.11) 


As we discussed in Chapter 12, it is useful to use 6, to distinguish the true value of the 
parameter from a generic element of ©. In particular examples, we will not bother 
making this distinction unless it is needed to make a point. 

For each xeE2%, &(f,x) can be written as E{log[p,(y,|x;)] |x; =x}—- 
E{log| f(y; | x;)] |x; = x}. Therefore, if the parametric model is correctly specified, 


then E{log[ f(y; | xi )|| xi} = E{log[ f(y; | x; 4)] | xi}, or 


E(¢;(9)|xi] = E[Z(@)|x],  9€9, (13.12) 
where 
G(8) = ¢(y;,xi,0) = log f(y; | xi; 9) (13.13) 


is the conditional log likelihood for observation i. Note that /(@) is a random function 
of 0, since it depends on the random vector (x;, y;). By taking the expected value of 
expression (13.12) and using iterated expectations, we see that 0) solves 

max E[/(8)], (13.14) 
where the expectation is with respect to the joint distribution of (x;,y;). The sample 
analogue of expression (13.14) is 


N 
—] ie 
max N 2 log f(y; |x; 0). (13.15) 


A solution to problem (13.15), assuming that one exists, is the conditional maximum 
likelihood estimator of 0., which we denote as Ê. We will sometimes drop “condi- 
tional” when it is not needed for clarity. 

The CMLE is clearly an M-estimator, since a maximization problem is easily 


turned into a minimization problem: in the notation of Chapter 12, take w; = (xi, y;) 
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and q(w;,0) = —log f(y, | xi;@). As long as we keep track of the minus sign in front 
of the log likelihood, we can apply the results in Chapter 12 directly. 

The motivation for the conditional MLE as a solution to problem (13.15) may 
appear backward if you learned about MLE in an introductory statistics course. In a 
traditional framework, we would treat the x; as constants appearing in the distribu- 
tion of y,, and we would define Ê as the solution to 


N 
max | f(yi| x4). (13.16) 
i=] 


Under independence, the product in expression (13.16) is the model for the joint 
density of (y,,..-,Yy), evaluated at the data. Because maximizing the function in 
(13.16) is the same as maximizing its natural log, we are led to problem (13.15). 
However, the arguments explaining why solving (13.16) should lead to a good esti- 
mator of 0, are necessarily heuristic. By contrast, the analogy principle applies directly 
to problem (13.15), and we need not assume that the x; are fixed. 

In our two examples, the conditional log likelihoods are fairly simple. 


Example 13.1 (continued): In the probit example, the log likelihood for observation 
iis Z(0) = y; log B(x;0) + (1 — y;) log[] — ®(x;0)]. 


Example 13.2 (continued): In the Poisson example, 7;(0) = —exp(x;0) + y;x;0 — 
log(y;!). Normally, we would drop the last term in defining 7;(0) because it does not 
affect the maximization problem. 


13.4 Consistency of Conditional Maximum Likelihood Estimation 


In this section we state a formal consistency result for the CMLE, which is a special 
case of the M-estimator consistency result Theorem 12.2. 


THEOREM 13.1 (Consistency of CMLE): Let {(x;,y,) : i= 1,2,...} be a random sam- 
ple with x;e ¥ c RÉ, ye Y c RË. Let © c R? be the parameter set, and denote 
the parametric model of the conditional density as { f(- |x; 0) : x € 27,0 € ©}. Assume 
that (a) f(-|x;6) is a true density with respect to the measure v(dy) for all x and 0, so 
that condition (13.8) holds; (b) for some 0, € ©, p,(-|x) =f (-| x; 0s), all x€ 2, and 
0, is the unique solution to problem (13.14); (c) © is a compact set; (d) for each 0 € ©, 
¢(-,@) is a Borel measurable function on Y x %; (e) for each (y,x) € Y x X, f(y, x,-) 
is a continuous function on ©; and (f) |/(w,0)| < b(w), all 0 e ©, and E[b(w)] < œ 
Then there exists a solution to problem (13.15), the CMLE 8, and plim ô = 0). 
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As we discussed in Chapter 12, the measurability assumption in part d is purely 
technical and does not need to be checked in practice. Compactness of © can be 
relaxed, but doing so usually requires considerable work. The continuity assumption 
holds in most econometric applications, but there are cases where it fails, such as 
when estimating certain models of auctions—see, for example, Donald and Paarsch 
(1996) and Paarsch and Hong (2006). The moment assumption in part f typically 
restricts the distribution of x; in some way, but such restrictions are rarely a serious 
concern. For the most part, the key assumptions are that the parametric model is 
correctly specified, that 0) is identified, and that the log-likelihood function is con- 
tinuous in 0. 

For the probit and Poisson examples, the log likelihoods are clearly continuous in 
0. We can verify the moment condition (f) if we bound certain moments of x; and 
make the parameter space compact. But our primary concern is that densities are 
correctly specified. For example, in the probit case, the density for y; given x; will be 
incorrect if the latent error e; is not independent of x; and normally distributed, or if 
the latent variable model is not linear to begin with. For identification we must rule 
out perfect collinearity in x;. The Poisson CMLE turns out to have desirable prop- 
erties even if the Poisson distributional assumption does not hold, but we postpone a 
discussion of the robustness of the Poisson CMLE until Section 13.11 and Chapter 
18. 


13.5 Asymptotic Normality and Asymptotic Variance Estimation 


Under the differentiability and moment assumptions that allow us to apply the the- 
orems in Chapter 12, we can show that the MLE is generally asymptotically normal. 
Naturally, the computational methods discussed in Section 12.7, including concen- 
trating parameters out of the log likelihood, apply directly. 


13.5.1 Asymptotic Normality 


We can derive the limiting distribution of the MLE by applying Theorem 12.3. We 

will have to assume the regularity conditions there; in particular, we assume that 0, is 

in the interior of ©, and /;(@) is twice continuously differentiable on the interior of ©. 
The score of the log likelihood for observation i is simply 


oj 
00, 


ôli 
"002 


Ohi. 0) (13.17) 


s;(0) = Va4(0)' = ( O Op 


a P x | vector as in Chapter 12. 
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Example 13.1 (continued): For the probit case, 0 is K x 1 and 


waone] o-r) 


Transposing this equation, and using a little algebra, gives 


s;(0) (x9 )xi[y; — P(x:0)] 
A @(x;0)[1 — (x;0)} | 


(13.18) 


Recall that x; is a K x 1 vector. 

Example 13.2 (continued): The score for the Poisson case, where 0 is again K x 1, is 

s:(0) = —exp(x;0)x;, + yix; = xi| y; — exp(x;0)]. (13.19) 
In the vast majority of cases, the score of the log-likelihood function has an im- 

portant zero conditional mean property: 

E[s;(00) | x;] = 0. (13.20) 


In other words, when we evaluate the P x 1 score at 0o, and take its expectation with 
respect to f(-|x;;0,), the expectation is zero. Under condition (13.20), E[s;(0.)| = 0, 
which was a key condition in deriving the asymptotic normality of the M-estimator 
in Chapter 12. 

To show condition (13.20) generally, let Eg[-|x,;] denote conditional expectation 
with respect to the density f(-|x;;@) for any 0 € ©. Then, by definition, 


Eo[si(0) [x] = | sly. xi Oly Ixi 0lad). 

If integration and differentation can be interchanged on int(@)—that is, if 

vol | Arixo) = | Wro Isoa) (13.21) 
for all x; € X, 0 € int(®)—then 

o= | vosty|xie)v(ay), (13.22) 
since fy f(y|Xx;0)v(dy) is unity for all 0, and therefore the partial derivatives with 
respect to 0 must be identically zero. But the right-hand side of equation (13.22) can 


be written as [,,[Vo/(y,xi,9)|f(y|x:;9)v(dy). Putting in 0, for 0 and transposing 
yields condition (13.20). 
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Example 13.1 (continued): Define u; = yi — ®(x;0,) = yi — E( y; | x;). Then 


@(Xi0o)X1U; 


s;(0o) = O(x;0,)(1 — O(x;0,)| 


and, since E(u; |x;) = 0, it follows that E[s;(0,) | x;] = 0. 


Example 13.2 (continued): Define u; = y; — exp(x;0o). Then s;(0)) = Xiu; and so 
E(s;(0.) | x;] = 0. 
Assuming that 7;(@) is twice continuously differentiable on the interior of ©, let 


the Hessian for observation i be the P x P matrix of second partial derivatives of 
G(0): 


H;(0) = Vosi(0) = Vj Z(0). (13.23) 


The Hessian is a symmetric matrix that generally depends on (x;, y;). Since MLE is a 
maximization problem, the expected value of H;(@,) is negative definite. Thus, to 
apply the theory in Chapter 12, we define 


A, = —E(H;(0,)], (13.24) 


which is generally a positive definite matrix when @ is identified. Under standard 
regularity conditions, the asymptotic normality of the CMLE follows from Theorem 
12.3: VN(Ô — 0) © Normal(0, A,'B.A,'), where By = Varf|s;(89)] = E[s;(00)si(Oo)']. 
It turns out that this general form of the asymptotic variance matrix is too compli- 
cated. We now show that B, = Ao. 

We must assume enough smoothness such that the following interchange of inte- 
gral and derivative is valid (see Newey and McFadden, 1994, Sect. 5.1, for the case of 
unconditional MLE): 


V(O Ix oman) = f VO xO). (13.25) 
Then, taking the derivative of the identity 

| SOYO Ix: 0)vdy) = Eobsi(O) |x] =0, 0 eint), 

and using equation (13.25), gives, for all 0 € int(®), 

—Eo|Hi() | x;] = Varo[s;(@) | xi], 


where the indexing by 0 denotes expectation and variance when f(-|x;;0) is the 
density of y; given x;. When evaluated at 0 = 0) we get a very important equality: 
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—E[H,;(0) | xi] = Els;(85)s;(80)' | xi (13.26) 


where the expectation and variance are with respect to the true conditional distri- 
bution of y; given x;. Equation (13.26) is called the conditional information matrix 
equality (CIME). Taking the expectation of equation (13.26) (with respect to the 
distribution of x;) and using the law of iterated expectations gives 


—E[H,(9)] = Elsi(40)si(9o)'), (13.27) 


or Ao = Bo. This relationship is best thought of as the unconditional information 
matrix equality (UIME). 


THEOREM 13.2 (Asymptotic Normality of CMLE): Let the conditions of Theorem 
13.1 hold. In addition, assume that (a) 0, € int(@); (b) for each (y,x)eY x 2, 
¢(y,x,-) is twice continuously differentiable on int(@); (c) the interchanges of de- 
rivative and integral in equations (13.21) and (13.25) hold for all 0 e int(@); (d) 
the elements of Vj/(y,x, 0) are bounded in absolute value by a function b(y, x) with 
finite expectation; and (e) A, defined by expression (13.24) is positive definite. Then 


VN(6 — 0) “ Normal(0, A>!) (13.28) 
and therefore 
Avar(ĝ) = A,'/N. (13.29) 


In standard applications, the log likelihood has many continuous partial deriva- 
tives, although there are examples where it does not. Some examples also violate the 
interchange of the integral and derivative in equation (13.21) or (13.25), such as when 
the conditional support of y; depends on the parameters 0,. In such cases we cannot 
expect the CMLE to have a limiting normal distribution; it may not even converge 
at the rate VN. Some progress has been made for specific models when the support 
of the distribution depends on unknown parameters; see, for example, Donald and 
Paarsch (1996). 


13.5.2 Estimating the Asymptotic Variance 


Estimating Avar(@) requires estimating A,. From the equalities derived previously, 
there are at least three possible estimators of A, in the CMLE context. In fact, under 
slight extensions of the regularity conditions in Theorem 13.2, each of the matrices 


N N N 
N1S>-H,(6), NYX s:(ô)s:(ô), and NYO A(x; 8) (13.30) 
i=1 i=1 i=l 
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converges to A, = Bo, where 


A(x;, 95) = —E[H(y;, xi, 90) | xi]. (13.31) 


— 
A 


Thus, Avar(0) can be taken to be any of the three matrices 


N 


=j -1 
-5mo , 5 sö ; or 
= 


i=1 


N —1 
SA0) (13.32) 


i=1 


and the asymptotic standard errors are the square roots of the diagonal elements of 
any of the matrices. We discussed each of these estimators in the general M-estimator 
case in Chapter 12, but a brief review is in order. The first estimator, based on the 
Hessian of the log likelihood, requires computing second derivatives. When the inverse 
exists, the estimate is positive definite because Ê maximizes the objective function. 

The second estimator in equation (13.32), based on the outer product of the score, 
depends only on first derivatives of the log-likelihood function. This simple estimator 
was proposed by Berndt, Hall, Hall, and Hausman (1974). Its primary drawback is 
that it can be poorly behaved in even moderate sample sizes, as we discussed in Sec- 
tion 12.6.2. 

If the conditional expectation A(x;,6,) is in closed form (as it is in some leading 
cases) or can be simulated—as discussed in Porter (2002)—then the estimator based 
on A(x;, Ô) has some attractive features. First, it often depends only on first deriva- 
tives of a conditional mean or conditional variance function. Second, it is also posi- 
tive definite when it exists because of the conditional information matrix equality 
(13.26). Third, this estimator has been found to have significantly better finite-sample 
properties than the outer product of the score estimator in some situations where 
A(x;, 95) can be obtained in closed form. 


Example 13.1 (continued): The Hessian for the probit log likelihood is a mess. 
Fortunately, E[H;(@) | x;] has a fairly simple form. Taking the derivative of equation 
(13.18) and using the product rule gives 


{$(xi9) F xix; 
O(x;0) [1 = ®(x;0)| 


H;(0) = + [yi — D(x:0)]L(x;0), 

where L(x;0) isa K x K complicated function of x;0 that we need not find explicitly. 
Now, when we evaluate this expression at 0, and note that E{[y; — B(x;0,)|L(x;9p) | 
xi} = [E(y; | x) — ®(x;9p)|L(xi80) = 0, we have 


{P(%iBo) F xix; 


SENER) [pu] = A0) = o — (,0,)) 
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—_— 


Thus, Avar(@) in probit analysis is 


X {9(x:0)}?x!x; T 
o; (x/8)[1 — 2 (13.33) 


i=1 


which is always positive definite when the inverse exists. Note that x}x; isa K x K 
matrix for each i. 


Example 13.2 (continued): For the Poisson model with exponential conditional 
mean, H;(@) = —exp(x;8)x/x;. In this example, the Hessian does not depend on y,, 
so there is no distinction between H;(9.) and E[H;(9o) | xi]. The positive definite es- 


timate of Avar(@) is simply 


i -1 
bs expr ðr : (13.34) 


i=l 
13.6 Hypothesis Testing 


Given the asymptotic standard errors, it is easy to form asymptotic ¢ statistics for 
testing single hypotheses. These ¢ statistics are asymptotically distributed as standard 
normal. 

The three tests covered in Chapter 12 are immediately applicable to the MLE case. 
Since the information matrix equality holds when the density is correctly specified, we 
need only consider the simplest forms of the test statistics. The Wald statistic is given 
in equation (12.63), and the conditions sufficient for it to have a limiting chi-square 
distribution are discussed in Section 12.6.1. 

Define the log-likelihood function for the entire sample by #(0) = 5A (0). Let 
6 be the unrestricted estimator, and let @ be the estimator with the Q nonredundant 
constraints imposed. Then, under the regularity conditions discussed in Section 
12.6.3, the likelihood ratio (LR) statistic, 


LR =2[L(6) — £(8)] (13.35) 
is distributed asymptotically as Xo under Ho. As with the Wald statistic, we cannot 
use LR as approximately Xo when @, is on the boundary of the parameter set. The 
LR statistic is very easy to compute once the restricted and unrestricted models have 
been estimated, and the LR statistic is invariant to reparameterizing the conditional 
density. 
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The score or LM test is based on the restricted estimation only. Let s;(@) be the 
P x 1 score of 4(0) evaluated at the restricted estimates 6. That is, we compute the 
partial derivatives of /;(0) with respect to each of the P parameters, but then we 
evaluate this vector of partials at the restricted estimates. Then, from Section 12.6.2 
and the information matrix equality, the statistics 


SIE) (EYEE) ~ 
2 s) B ss) (>. s) (13.36) 


have limiting Xo distributions under Ho. As we know from Section 12.6.2, the first 
statistic is not guaranteed to be nonnegative (because the matrix in the middle is not 
necessarily positive definite) and is not invariant to reparameterizations, but the outer 
product statistic is. In addition, using the conditional information matrix equality, it 
can be shown that the LM statistic based on A; is invariant to reparameterization. 
Davidson and MacKinnon (1993, Sect. 13.6) show invariance in the case of uncon- 
ditional maximum likelihood. Invariance holds in the more general conditional ML 
setup, with x; containing any conditioning variables (see Problem 13.5). We have al- 
ready used the expected Hessian form of the LM statistic for nonlinear regression in 
Section 12.6.2. We will use it in several applications in Part IV, including binary re- 
sponse models and Poisson regression models. In these examples, the statistic can be 
computed conveniently using auxiliary regressions based on weighted residuals. 

Because the unconditional information matrix equality holds, we know from Sec- 
tion 12.6.4 that the three classical statistics have the same limiting distribution under 
local alternatives. Therefore, either small-sample considerations, invariance, or com- 
putational issues must be used to choose among the statistics. 


13.7 Specification Testing 


Because MLE generally relies on its distributional assumptions, it is useful to have 
available a general class of specification tests that are simple to compute. One general 
approach is to nest the model of interest within a more general model (which may be 
much harder to estimate) and obtain the score test against the more general alternative. 
RESET in a linear model and its extension to exponential regression models in Section 
12.6.2 are examples of this approach, albeit in a non-maximum-likelihood setting. 
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In the context of MLE, it makes sense to test moment conditions implied by the 
conditional density specification. Let w; = (x;, y;) and suppose that, when /(-|x;0) is 
correctly specified, 


Ho : Elg(wi, 05)] = 0, (13.37) 


where g(w, 0) is a Q x 1 vector. Any application implies innumerable choices for the 
function g. Since the MLE 6 sets the sum of the score to zero, g(w, 0) cannot contain 
elements of s(w, 0). Generally, g should be chosen to test features of a model that are 
of primary interest, such as first and second conditional moments, or various condi- 
tional probabilities. 

A test of hypothesis (13.37) is based on how far the sample average of g(w;, 0) is 
from zero. To derive the asymptotic distribution, note that 


N 

N'Y gô) = NP Yt si(9)] 
i=l 

holds trivially because 57)", s;(ĝ) = 0, where 


To = {E|8:(90)8i(80)']}' {ElS:(9o)8i(9) I} 


is the P x Q matrix of population regression coefficients from regressing g;(0o)” on 
s;(O))'. Using a mean-value expansion about @ and algebra similar to that in Chap- 
ter 12, we can write 


WIP a(@) M5509] = N-! Yt) — (0) 
i=1 


+ E[Vog;(Bo) — TL, Vosi(8o)] WN (Ê — 8) + op(1). 
(13.38) 


The key is that, when the density is correctly specified, the second term on the right- 
hand side of equation (13.38) is identically zero. Here is the reason: First, equation 
(13.27) implies that [EVos;(0)|{E|s;(95)s;(90)']}_ | = —Ip. Second, an extension of the 
conditional information matrix equality (Newey, 1985; Tauchen, 1985) implies that 

—E[Vog;(9o) | xi] = Elg;(90)si(90)’ | xi]. (13.39) 


To show equation (13.39), write 


Eog) |x] =| aly. xi8)/(y|xs8)v(dy) = 0 (13.40) 


484 Chapter 13 


for all 0. Now, if we take the derivative with respect to 0 and assume that the inte- 
grals and derivative can be interchanged, equation (13.40) implies that 


f, Vog(y; Xi, OY (y | xi; 0)v(dy) +l, B(y, Xi, 0) Vof (y |x; 0)v(dy) = 0 

or Eo[Vog;(0) | xi] + Eolg;(0)s:(0)' | xi] = 0, where we use the fact that Va f (y |x; 0) = 

s(y, x, 0)'f(y|x;0). Plugging in 0 = 0, and rearranging gives equation (13.39). 
What we have shown is that 


noe 3 [g;(0) — 11) s;(8)| = N"? D lg;(0ə) — I! s;(09)] + op(1), 
i=1 i=1 


which means these standardized partial sums have the same asymptotic distribution. 
Letting 


it is easily seen that plim Ñ = IM, under standard regularity conditions. Therefore, 
the asymptotic variance of N~'/? 37> [g,(@) — 11/s;(8)| = N2 EX, g,(8) is con- 
sistently estimated by N~! YX (ê; — II’8,)(g, — T1’8;)’. When we construct the qua- 
dratic form, we get the Newey-Tauchen-White (NTW) statistic, 


ie a 
5 ni) (13.41) 
i=] 


This statistic was proposed independently by Newey (1985) and Tauchen (1985), and 
is an extension of White’s (1982a) information matrix (IM) test statistic. 

For computational purposes it is useful to note that equation (13.41) is identical to 
N — SSRo = NR; from the regression 


N 'T ON 
NTW = Ss s)| Se - 1’; (ê; — 11’8;)' 
i=l 


i=1 


long,@, i=1,2,...,N, (13.42) 


where SSRo is the usual sum of squared residuals. Under the null that the density is 
correctly specified, NTW is distributed asymptotically as Xo» assuming that g(w, 0) 
contains Q nonredundant moment conditions. Unfortunately, the outer product form 
of regression (13.42) means that the statistic can have poor finite-sample properties. 
In particular applications—such as nonlinear least squares, binary response analysis, 
and Poisson regression, to name a few—it is best to use forms of test statistics based 
on the expected Hessian. We gave the regression-based test for NLS in equation 
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(12.72), and we will see other examples in later chapters. For the IM test statistic, 
Davidson and MacKinnon (1992) have suggested an alternative form of the IM sta- 
tistic that appears to have better finite-sample properties. 


Example 13.2 (continued): To test the specification of the conditional mean for 
Poission regression, we might take g(w,@) = exp(x0)x'| y — exp(x)] = exp(x0)s(w, 0), 
where the score is given by equation (13.19). If E(y | x) =exp(x0,) then E/g(w, 0.) | x] 
= exp(x0, )E[s(w, 0.) |x] = 0. To test the Poisson variance assumption, Var(y|x) = 
E(y |x) =exp(x0)), g can be of the form g(w, 0) = a(x, 0){[y —exp(x0)]? — exp(x0)}, 
where a(x,@) is a Qx 1 vector. If the Poisson assumption is true, then u = y— 
exp(x,) has a zero conditional mean and E(u? |x) = Var(y|x) = exp(x0,). It fol- 
lows that E[g(w, 0») | x] = 0. 


Example 13.2 contains examples of what are known as conditional moment tests. 
As the name suggests, the idea is to form orthogonality conditions based on some 
key conditional moments, usually the conditional mean or conditional variance, but 
sometimes conditional probabilities or higher order moments. The tests for nonlinear 
regression in Chapter 12 can be viewed as conditional moment tests, and we will 
see several other examples in Part IV. For reasons discussed earlier, we will avoid 
computing the tests using regression (13.42) whenever possible. See Newey (1985), 
Tauchen (1985), and Pagan and Vella (1989) for general treatments and applications 
of conditional moment tests. White’s (1982a) IM test can often be viewed as a con- 
ditional moment test; see Hall (1987) for the linear regression model and White 
(1994) for a general treatment. White (1994, Chap. 10) shows how to allow the mo- 
ment function to depend on parameters other than 6%. 


13.8 Partial (or Pooled) Likelihood Methods for Panel Data 


Up to this point we have assumed that the parametric model for the density of y 
given x is correctly specified. This assumption is fairly general because x can contain 
any observable variable. The leading case occurs when x contains variables we view 
as exogenous in a structural model. In other cases, x will contain variables that are 
endogenous in a structural model, but putting them in the conditioning set and find- 
ing the new conditional density makes estimation of the structural parameters easier. 

For studying various panel data models, for estimation using cluster samples, and 
for various other applications, we need to relax the assumption that the full condi- 
tional density of y given x is correctly specified. In some examples, such a model is 
too complicated. Or, for robustness reasons, we do not wish to fully specify the den- 
sity of y given x. 
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13.8.1 Setup for Panel Data 


For panel data applications we let y denote a T x 1 vector, with generic element y,. 
Thus, y; is a T x 1 random draw vector from the cross section, with ¢th element y,,. 
As always, we are thinking of T small relative to the cross section sample size. With a 
slight notational change we can replace y, with, say, a G-vector for each ¢, an ex- 
tension that allows us to cover general systems of equations with panel data. 

For some vector x; containing any set of observable variables, let D( y, | x+) denote 
the distribution of y, given x;. The key assumption is that we have a correctly speci- 
fied model for the density of y, given x;; call it f,(y,| x50), t= 1,2,..., T. The vector 
x; can contain anything, including conditioning variables z,, lags of these, and lagged 
values of y. The vector @ consists of all parameters appearing in f, for any t; some or 
all of these may appear in the density for every t, and some may appear only in the 
density for a single time period. 

What distinguishes partial likelihood from maximum likelihood is that we do not 
assume that 


=s 


D(yir | Xi) (13.43) 


t 


is a conditional distribution of the vector y, given some set of conditioning variables. 
In other words, even though /,(y,| x7; 0) is the correct density for y, given Xy = x; 
for each t, the product of these is not (necessarily) the density of y; given some con- 
ditioning variables. Usually, we specify f,( y, | x,;0) because it is the density of interest 
for each t. 

We define the partial log likelihood for each observation i as 


T 
(0) =X log flyin | xin; 9), (13.44) 
t=1 


which is the sum of the log likelihoods across t. What makes partial likelihood 
methods work is that 0, maximizes the expected value of equation (13.44) provided 
we have the densities f,(y,|x,;@) correctly specified. We also refer to equation 
(13.44) as a pooled log likelihood. 

By the Kullback-Leibler information inequality, 09. maximizes Eflog fi( Yu | xis; 0) 
over © for each t, so 8) also maximizes the sum of these over t. As usual, identifica- 
tion requires that 0) be the unique maximizer of the expected value of equation 
(13.44). It is sufficient that 0, uniquely maximizes E[log f:( y; |X; 0)] for each ¢, but 
this assumption is not necessary. 
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The partial (pooled) maximum likelihood estimator (PMLE) Ê solves 
N T 
Ea 13.45 
max JO D2 log fl vu xi 0) snag 


which is clearly an M-estimator problem (where the asymptotics are with fixed T and 
N — œ). Therefore, from Theorem 12.2, the partial MLE is generally consistent 
provided 6, is identified. 

It is also clear that the partial MLE will be asymptotically normal by Theorem 
12.3 in Section 12.3. However, unless 


Pol¥ |Z) = | | Ai | X05 8) (13.46) 


om 


ll 
jak 


t 


for some subvector z of x, we cannot apply the CIME. A more general asymptotic 
variance estimator of the type covered in Section 12.5.1 is needed, and we provide 
such estimators in the next two subsections. 

It is useful to discuss at a general level why equation (13.46) does not necessarily 
hold in a panel data setting. First, suppose x; contains only contemporaneous con- 
ditioning variables, z;; in particular, x; contains no lagged dependent variables. Then 
we can always write 


Po(Y |z) = p? (y1 |2) -PEY | V12) PRON Ves Y2 YZ) 
Pr(Vr |Vr-1, Yr-2;---; Y1 Z), 


where p?(y,|J-1; Y-2;---, Y1,Z) is the true conditional density of y, given y,_1, 
Y:-2;---; Y1 and Z = (z1,...,Zr). (For t= 1, p? is the density of y, given z.) For 
equation (13.46) to hold, we should have 


PPV: | Pets Peds + Y Z) = fil yi |Z 90), t= | eee ae 


which requires that, once z; is conditioned on, neither past lags of y, nor elements of 
z from any other time period—past or future—appear in the conditional density 
PO(¥: | Yt-1; Vi-2s ++ Y1, Z). Generally, this requirement is very strong, as it requires a 
combination of strict exogeneity of z; and the absense of dynamics in p°. 

Equation (13.46) is more likely to hold when x, contains lagged dependent vari- 
ables. In fact, if x, contains only lagged values of y,, then 


om) 


Pol¥) = | | 40| X45 80) 


t=1 
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holds if f(y, | X13 00) = P?(¥;| 1-1, Yı-2;---, Y1) for all ¢ (where p? is the uncondi- 
tional density of y,), so that all dynamics are captured by f,. When x; contains some 
variables z; in addition to lagged y, equation (13.46) requires that the parametric 
density captures all of the dynamics—that is, that all lags of y, and z; have been 
properly accounted for in f(y, | x,,0.)—and strict exogeneity of zz. 

In most treatments of MLE of dynamic models containing additional exogenous 
variables, the strict exogeneity assumption is maintained, often implicitly by taking 
Z; to be nonrandom. In Chapter 7 we saw that strict exogeneity played no role in 
getting consistent, asymptotically normal estimators in linear panel data models us- 
ing pooled OLS, and the same is true here. We also allow models where the dynamics 
have been incompletely specified. 

With partial MLE we are interested in fully specifying a density for the conditional 
distribution D(yj;|x;-), and it is useful to have a general yet simple way to define 
strict exogeneity of {x;,:t=1,...,7}. The definition is simply stated: the con- 
ditioning variables are strictly exogenous if D(yi:| xi,x2,...,Xir) = D( Yi | xix) for 
t=1,...,7. Naturally, this condition must fail if x;, contains lags of y, and, just as 
in linear models, it can fail if X; contains elements Zz; whose future values react to 
unpredictable changes in yj. 


Example 13.3 (Probit with Panel Data): To illustrate the previous discussion, we 
consider estimation of a panel data binary choice model. The idea is that, for each 
unit 7 in the population (individual, firm, and so on) we have a binary outcome, y;,, 
for each of T time periods. For example, if t represents a year, then y, might indicate 
whether a person was arrested for a crime during year t. 

Consider the model in latent variable form: 


Vit = Xit Io + Cit 
Vir = [ya > 0] (13.47) 
eit | Xin ~ Normal(0, 1). 


The vector x; might contain exogenous variables z;,, lags of these, and even lagged 
yi, (not lagged y*). Under the assumptions in model (13.47), we have, for each 
t, P( vx = 1|X#) = (Ki90), and the density of y, given xi =x, is f(y,|X:) = 
[®(x,9.)}"[1 — ®(x,0.)] 

The partial log likelihood for a cross section observation i is 


T 
4(0) = $ {Vix log B(x) + (1 — yi) log[1 — ®(xi9)]} (13.48) 


t=1 
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and the partial MLE in this case—which simply maximizes /;(0) summed across all 
i—is the pooled probit estimator. With T fixed and N — œ, this estimator is consis- 
tent and //N-asymptotically normal without any assumptions other than identifica- 
tion and standard regularity conditions. 

It is very important to know that the pooled probit estimator works without im- 
posing additional assumptions on e; = (ej1,..., eir)". When xy contains only exoge- 
nous variables zy, it would be standard to assume that 


ei: is independent of z; = (Zi, Zi2,...,Zir), a nomena A (13.49) 


This is the natural strict exogeneity assumption (and is much stronger than simply 
assuming that e; and Z; are independent for each rf). The crime example can illustrate 
how strict exogeneity might fail. For example, suppose that Z; measures the amount 
of time the person has spent in prison prior to the current year. An arrest this year 
(yi, = 1) certainly has an effect on expected future values of Zx, so that assumption 
(13.49) is almost certainly false. Fortunately, we do not need assumption (13.49) to 
apply partial likelihood methods. 

A second standard assumption is that the ex, t= 1,2,...,7 are serially indepen- 
dent. This is especially restrictive in a static model. If we maintain this assumption in 
addition to assumption (13.49), then equation (13.46) holds (because the y, are then 
independent conditional on z;) and the partial MLE is a conditional MLE. 

To relax the assumption that the y, are conditionally independent, we can allow 
the ep to be correlated across ż (still assuming that no lagged dependent variables 
appear). A common assumption is that e; has a multivariate normal distribution with 
a general correlation matrix. Under this assumption, we can write down the joint 
distribution of y; given z;, but it is complicated, and estimation is very computation- 
ally intensive (for discussions see Keane, 1993, and Hajivassilou and Ruud, 1994). 
We will cover a special case, the random effects probit model, in Chapter 15. 

A nice feature of the partial MLE is that Ê will be consistent and asymptotically 
normal even if the e; are arbitrarily serially correlated. This result is entirely analo- 
gous to using pooled OLS in linear panel data models when the errors have arbitrary 
serial correlation. 

When x; contains lagged dependent variables, model (13.47) provides a way of 
examining dynamic behavior. Or, perhaps y;,, is included in x; as a proxy for 
unobserved factors, and our focus is on policy variables in z;;. For example, if y; is a 
binary indicator of employment, y; ,_; might be included as a control when studying 
the effect of a job training program (which may be a binary element of Z;) on the 
employment probability; this method controls for the fact that participation in job 
training this year might depend on employment last year, and it captures the fact that 
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employment status is persistent. In any case, provided P(y,, = 1 | xx) follows a probit, 
the pooled probit estimator is consistent and asymptotically normal. The dynamics 
may or may not be correctly specified (more on this topic later), and the z;, need not 
be strictly exogenous (so that whether someone participates in job training in year t 
can depend on the past employment history). 


13.8.2 Asymptotic Inference 


The most important practical difference between conditional MLE and partial MLE 
is in the computation of asymptotic standard errors and test statistics. In many cases, 
including the pooled probit estimator, the pooled Poisson estimator (see Problem 
13.6), and many other pooled procedures, standard econometrics packages can be 
used to compute the partial MLEs. However, except under certain assumptions, the 
usual standard errors and test statistics reported from a pooled analysis are not valid. 
This situation is entirely analogous to the linear model case in Section 7.8 when the 
errors are serially correlated. 

Estimation of the asymptotic variance of the partial MLE is not difficult. In fact, 
we can combine the M-estimation results from Section 12.5.1 and the results of Sec- 
tion 13.5 to obtain valid estimators. 

From Theorem 12.3, we have Avar V/N(6 — 0.) = A,'BoA,!, where 


T T 
Ao = —E[V84(00)] = — ELV; 4ir(80)] = $ ELAw(0)], 


Au(bo) = —E[V} (0o) | Xir, and 
8i(0) = Volal)’. 


Because we assume that 0, is in the interior of the parameter space, and 0, max- 
imizes E{/;;(0) | Xx] over ©, it is generally true that Efs#(00) | Xx] = 0 for all t. If {xi} 
is strictly exogenous, then E[s;,(0,)|x;] =0 (because D(y;|x;) = D(yi,| xiz)), but 
without strict exogeneity we can only say that s#(0o) has zero mean conditional 
on X; (which implies E[s;;(0,.)| = 0, of course). The natural definition of sequential 
exogeneity in this context is D(yir| Xi, Xi,7-1,---;Xa) = D(vir| xir), in which case 
E[sis(Oo) | Xi, Xi,1-1,---, Xi] = 0. As we will see in the next subsection, sequential 
exogeneity ensures that the scores are serially uncorrelated if X; contains y; 1-1. If Xi 
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only includes contemporaneous variables z;;, lags of such variables, or both, sequen- 
tial exogeneity does not imply the scores are serially uncorrelated. 

The matrix A, is just the sum across ¢ of minus the expected Hessian. The matrix 
B, generally depends on the correlation between the scores at different time periods: 
E[Sit(O)Sir(Oo)'], t # r. For each t, the CIME holds: 


Ait(Oo) = E[Sir(O0)8ir( 8. Ae | xii]. 


If {xx} is strictly exogenous then Aj;(O5) = E[si:(@o)8ir(O)’ |X]. But strict exogeneity 
does not help simplify inference because, whether or not strict exogeneity holds, 
—E[H;(.) | xi] = E[s;(9.)s;(00)’ | x;] likely fails if the scores are serially correlated. 
More important, serial correlation in the scores causes Bo 4 Ao. Thus, to perform 
inference in the context of partial MLE, we generally need separate estimates of 
A, and By. Given the structure of the partial MLE, these are easy to obtain. Three 
possibilities for A, are 


T 


N! 5 XO -VWiti(6), N! s 3 Alô), and 


i=l =l i=] f= 


N 
DD 


isl t= 


T 
Sa(0 9 )8i:(0 (13.50) 
1 


The validity of the second of these follows from a standard iterated expectations 
argument, and the last of these follows from the CIME for each ż. In most cases, the 
second estimator is preferred when it is easy to compute. 

Since B, depends on E[s;;(0))8;:(O)'] as well as on cross product terms, there are 
also at least three estimators available for By. The simplest is 


N T 
N'N S8! = miS Y sas +N- Dye a (13.51) 
i=l i=l =l i=l t=1 r#t 

where the second term on the right-hand side accounts for possible serial correlation 
in the score. The first term on the right-hand side of equation (13.51) can be replaced 
by one of the other two estimators in equation (13.50). The asymptotic variance of 
6 is estimated, as usual, by A'BA! /N for the chosen estimators A and B. The 
asymptotic standard errors come directly from this matrix, and Wald tests for linear 
and nonlinear hypotheses can be obtained directly. The robust score statistic dis- 
cussed in Section 12.6.2 can also be used. When B, # Ao, the likelihood ratio statistic 
computed after pooled estimation is not valid. 
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Because the CIME holds for each t, Bo = Ao when the scores evaluated at 0) are 
serially uncorrelated, that is, when 


E[Si:(90)8ir(Io) | = 0, LET. (13.52) 


When the score is serially uncorrelated, inference is very easy: the usual MLE statis- 
tics computed from the pooled estimation, including likelihood ratio statistics, are 
asymptotically valid. Effectively, we can ignore the fact that a time dimension is 
present. The estimator of Avar(ĝ) is just £! /N, where Â is one of the matrices in 
equation (13.50). 


Example 13.3 (continued): For the pooled probit example, a simple, general esti- 
mator of the asymptotic variance is 


N =I N 
yada Essay yaw i (13.53) 
i=l =l i=l i=l =l 
where 
Ai(8) = {4(x10)} Niet 
@(x;0)[1 — O(x;,8)| 
and 


:  O(XiO)Xi Vir — D(x) 
=2 %0) SE E i B(x, 0) 


The estimator (13.53) contains cross product terms of the form s;(0)si(0)', t £r, 
and so it is fully robust. If the score is serially uncorrelated, then the usual probit 
standard errors and test statistics from the pooled estimation are valid. We will 
discuss a sufficient condition for the scores to be serially uncorrelated in the next 
subsection. 


13.8.3 Inference with Dynamically Complete Models 


There is a very important case where condition (13.52) holds, in which case all 
statistics obtained by treating /;(@) as a standard log likelihood are valid. For any 
definition of x,, we say that {f,(y,|x;0.):t=1,...,7} is a dynamically complete 
conditional density if 


Si yil Xi Oo) = P? (Vi | Xi Yer X1, V2- Vas X1), t=1,...,T. (13.54) 


Maximum Likelihood Methods 493 


In other words, f;(y,|x:;9)) must be the conditional density of y, given x, and the 
entire past of (x;, y,). Equation (13.54) implies that the density is correctly specified. 
We can state the assumption that x; captures all of the distributional dynamics for 
Vir by writing D(yir| Xi, Vir-1, Xit-1,---; Vil, Xa) = D(vir| Xir), a shorthand that is 
useful because it is separate from model specification. 

When x, = z; for contemporaneous exogenous variables, assumption (13.54) is 
very strong: it means that, once z; is controlled for, no past values of z; or y, appear 
in the conditional density p?(y,| 21, ¥j1,Zr-1, Y2,- --, 1,21). When x; contains z; 
and some lags—similar to a finite distributed lag model—then equation (13.54) is 
perhaps more reasonable, but it still assumes that lagged y, has no effect on y, once 
current and lagged z; are controlled for. That assumption (13.54) can be false is 
analogous to the omnipresence of serial correlation in static and finite distributed lag 
regression models. One important feature of dynamic completeness is that it does not 
require strict exogeneity of z, [since only current and lagged x, appear in equation 
(13.54)]. 

Dynamic completeness is more likely to hold when x, contains lagged dependent 
variables. The issue, then, is whether enough lags of y, (and z+) have been included in 
x, to fully capture the dynamics. For example, if x, = (z;,y,_,), then equation 
(13.54) means that, along with z,, only one lag of y, is needed to capture all of the 
dynamics. 

Showing that condition (13.52) holds under dynamic completeness is easy. First, 
for each t, E[s;(05) | Xx] = 0, since f,(y,|x,;9.) is a correctly specified conditional 
density. But then, under assumption (13.54), 


E[Si(9o) | Xin Vi t-19 +++) Vay Xa] = 0. (13.55) 


Now consider the expected value in condition (13.52) for r < t. Since s;-(05) is a 
function of (Xr, Y), which is in the conditioning set (13.55), the usual iterated 
expectations argument shows that condition (13.52) holds. It follows that, under dy- 
namic completeness, the usual maximum likelihood statistics from the pooled esti- 
mation are asymptotically valid. This result is completely analogous to pooled OLS 
under dynamic completeness of the conditional mean and homoskedasticity (see 
Section 7.8). 

If the panel data probit model is dynamically complete, any software package 
that does standard probit can be used to obtain valid standard errors and test statis- 
tics, provided the response probability satisfies P(y,, = 1 | xix) = P(Y = 1 | Xin Vimi 
X;;-1,---). Without dynamic completeness the standard errors and test statistics 
generally need to be adjusted for serial dependence. 
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Since dynamic completeness affords nontrivial simplifications, does this fact mean 
that we should always include lagged values of exogenous and dependent variables 
until equation (13.54) appears to be satisfied? Not necessarily. Static models are 
sometimes desirable even if they neglect dynamics. For example, suppose that we 
have panel data on individuals in an occupation where pay is determined partly by 
cumulative productivity. (Professional athletes and college professors are two ex- 
amples.) An equation relating salary to the productivity measures, and possibly de- 
mographic variables, is appropriate. Nothing implies that the equation would be 
dynamically complete; in fact, past salary could help predict current salary, even after 
controlling for observed productivity. But it does not make much sense to include 
past salary in the regression equation. As we know from Chapter 10, a reasonable 
approach is to include an unobserved effect in the equation, and this does not lead to 
a model with complete dynamics. See also Section 13.9. 

We may wish to test the null hypothesis that the density is dynamically complete. 
White (1994) shows how to test whether the score is serially correlated in a pure time 
series setting. A similar approach can be used with panel data. A general test for 
dynamic misspecification can be based on the limiting distribution of (the vectoriza- 
tion of) 


N T 
—1/2 a al 
N" X SiS} 1 

J 


i=l 1=2 
where the scores are evaluated at the pooled MLE. Rather than derive a general 
statistic here, we will study tests of dynamic completeness in particular applications 
later (see particularly Chapters 15, 16, and 17). 


13.9 Panel Data Models with Unobserved Effects 


As we saw in Chapters 10 and 11, linear unobserved effects panel data models play 
an important role in modern empirical research. Nonlinear unobserved effects panel 
data models are becoming increasingly more important. Although we will cover 
particular models in Chapters 15 through 18, it is useful to have a general treatment. 


13.9.1 Models with Strictly Exogenous Explanatory Variables 


For each i, let {(y;, Xi) : t= 1,2,..., T} be a random draw from the cross section, 
where y,, and x;, can both be vectors. Associated with each cross section unit į is 
unobserved heterogeneity, ¢;, which could be a vector. We assume interest lies in the 
distribution of y; given (x;-,¢;). The vector Xy can contain lags of contemporaneous 
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variables, say Z (for example, xj = (Zit, Zi,1-1, Zi,1-2)), or even leads of Zp (for ex- 
ample, xj = (Zir,Z;,r41)), but not lags of y,,. Whatever the lag structure, we let t = 1 
denote the first time period available for estimation. 

Let f(y, | x:,¢;8) denote a correctly specified density for each t. A key assumption 
on Xy is strict exogeneity conditional on the unobserved effects: 


D(y;, | X1, X2,- --, XiT, Ci) = D(y;, | Xir €i), t= ) RS SA (13.56) 


which means that x;, r # t, does not appear in the conditional distribution of y,, once 
Xy and c; have been counted for. In addition to ruling out lagged dependent vari- 
ables, (13.56) does not allow for general feedback from unanticipated changes in y; 
to changes in X; nn for h > 1. 

A common but restrictive approach to estimating 0, (and other quantities of in- 
terest, such as average partial effects) is to assume that c; is independent of x; = 
(Xi1,X2,---,Xir), that is, D(e;|x;) = D(c;), and to model the distribution of ¢;. Such 
an approach is very similar to the random effects approach to linear panel data 
models in Chapter 10. There we did not assume full independence (because linear 
models do not require such a strong assumption), but we assumed conditional mean 
independence, E(c;|x;) = E(e;), or, at the very least, zero correlation. In general 
nonlinear models, it is handy to label D(c;|x;) = D(c;) as a random effects (RE) 
assumption. 

In special cases, which we will address in Chapters 15 and 18, we can consistently 
estimate 0, without imposing any assumptions about D(c;|x;). Such situations are 
reminiscent of the fixed effects (FE) assumptions from Chapter 10, and we will use 
that label for nonlinear models, too. As we will see through later examples, in non- 
linear models estimation of 0, is not always sufficient to calculate the quantities of 
interest, but it is nevertheless useful when possible. 

One way to avoid specifying a model for D(c;|x;) would be to treat 
{e;:i=1,...,N} as parameters to estimate along with 0o. However, because the 
number of c; increases with N, attempting to estimate them leads to an incidental 
parameters problem for estimating 0». Namely, we cannot consistently estimate 0 
with fixed T and N — co (except by fluke in a couple of special cases, including the 
linear model with additive c; in Chapter 10). For this reason, in this book we reserve 
the designation “fixed effects” for situations in which we can eliminate the c; from a 
particular conditional distribution that still depends on 0., and then apply condi- 
tional maximum likelihood methods to consistently estimate 0, for fixed T. We do 
not give a general treatment because the method applies only in special cases, and we 
cover those in Part IV. 
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A middle ground between RE and FE approaches is to allow D(e; | x;) to depend 
on x;, but then to model this distribution by specifying a parametric model of the 
conditional density. In Chapter 10, we mentioned that modeling E(c;|x;) as a func- 
tion of x; has been called a correlated random effects (CRE) approach, and we 
will use this label subsequently to refer to situations where we model the distri- 
bution D(c; | x;). Often, to impose parsimony we restrict the way in which D(c; | x;) 
can depend on the time series {Xp : t= 1,2,..., T}. A common restriction is 
D(e; | x;) = D(e;|X;), where x; is the time average. With large enough T, we can 
allow D(c;|x;) to depend on other features, such as the individual-specific variance 
(T-1)! Bab — x;)'(xj, — X;) or average growth rates. For the present treat- 
ment, there is no gain in considering special cases. Therefore, let A(e | x; 6) be a cor- 
rectly specified density for D(¢; | x;). 

Once we have specified /(c | x; ô), there are two common ways to proceed. First, we 
can make the additional assumption that, conditional on (x;,¢;), the y; are indepen- 
dent. Then, the joint density of (y,;,...,y,;7), given (x;, ¢;), is 


=s 


Lily; | Xit, Ci; 0). 


t=1 


We cannot use this density directly to estimate 0, because we do not observe the 
outcomes c;. Treating the c; as parameters to estimate with 0, leads to the incidental 
parameters problem mentioned earlier. Instead, we can use the density of ¢; given x; 
to integrate out the dependence on c. The density of y; given x; is 


T 
| Tho. [xm c; 0) A(e|x::4,) de, (13.57) 
R” | =I 


where J is the dimension of c and A(c |x; ô) is the correctly specified model for the 
density of c; given x; = x. For concreteness, we assume that c is a continuous random 
vector. For each i, the log-likelihood function is 


T 
ef TI FAY | Xie, n) h(e | Xi; 60) ich (13.58) 
t=1 


[It is important to see that expression (13.58) does not depend on the ¢;; c has been 
integrated out.] Assuming identification and standard regularity conditions, we can 
consistently estimate 0) and ĝo by conditional MLE, where the asymptotics are for 
fixed T and N — co. The CMLE is W/N-asymptotically normal. 

A different approach is often simpler and places no restrictions on the joint distri- 
bution of the y,, conditional on (x;,¢;). For each t, we can obtain the density of y; 
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given x; by integrating out ¢;: 
| P LAY | Xin €; Oo) ]A(€ | Xi; ĝo) de. 
R 


Now the problem becomes one of partial MLE. We estimate 9, and 6, by maximizing 


N 
Ss log} | iii | xin 8)Ih(e| x8) del, (13.59) 
i=l = 
where the term in braces is a correctly specified model of the density of D(y,,|x;). 
(Actually, using PMLE, 0, and 6, are not always separately identified, although in- 
teresting functions of them, such as average partial effects, are. We will see examples 
in Chapters 15, 16, and 17.) Across time, the scores for each i will necessarily be se- 
rially correlated because the y, are dependent when we condition only on x;, and not 
also on c;. Therefore, we must make inference robust to serial dependence, as in 
Section 13.8.2. In Chapter 15, we will study both the conditional MLE and partial 
MLE approaches for unobserved effects probit models. We do the same for Tobit 
models in Chapter 17. 


13.9.2 Models with Lagged Dependent Variables 


Now assume that we are interested in modeling D(y; | Zir, Yi, 1, €;) where, for sim- 
plicity, we include only contemporaneous conditioning variables, z;,, and only one lag 
of y;,. Adding lags (or even leads) of z; or more lags of y, requires only a notational 
change. 

A key assumption is that we have the dynamics correctly specified and that z; = 


{zi1,...,Zir} is appropriately strictly exogenous (conditional on ¢;). These assump- 
tions are both captured by 
D (Y; | Zits Yi 1 €) = D(Yir | Zi, Yit- ++» Yio Ci). (13.60) 


We assume that f,(y, | Z:, Y1, €; 0) is a correctly specified density for the conditional 
distribution on the left-hand side of equation (13.60). Given strict exogeneity of 
{Zi:t=1,...,T} and dynamic completeness, the density of (y;,...,y,;r) given 
(zi Z, Yio = Yo; Ci c) is 


SiYi | Ze, Y1 €; Oo): (13.61) 


=~ 


ll 
he 


(By convention, y,9 is the first observation on y,,.) Again, to estimate 0o, we integrate 
c out of this density. To do so, we specify a density for c; given z; and the initial value 
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Y;o (sometimes called the initial condition). Let 4(c | Z, yọ; ô) denote the model for this 
conditional density. Then, assuming that we have this model correctly specified, the 


density of (Yy. -, Yr) given (Z; = Z, Yo = Yo) is 
T 

L i ALSTER: 0) h(e|, Yo; 8) de, (13.62) 
t=1 


which, for each i, leads to the log-likelihood function conditional on (z;, y;ọ): 


T 
oe | Th viene ns) h(c| Zi, Yio; ô) ich (13.63) 
R’ |=] 


We sum expression (13.63) across i = 1,..., N and maximize with respect to 0 and 6 
to obtain the CMLEs. Provided all functions are sufficiently differentiable and iden- 
tification holds, the CMLEs are consistent and /N-asymptotically normal, as usual. 
Because we have fully specified the conditional density of (y;,,...,y¥;r) given (Z;, Yio), 
the general theory of CMLE applies directly. (The fact that the distribution of y;o 
given z; would typically depend on @ has no bearing on the consistency of the 
CMLE. The fact that we are conditioning on y;ọ, rather than basing the analysis on 
D(yi0; Yis- -> Yir | Zi), means that we are generally sacrificing efficiency. But by con- 
ditioning on y; we do not have to find D(y,, | z;), a feat that can be very difficult, if 
not impossible.) The asymptotic variance of (8’,6’)' can be estimated by any of the 
formulas in equation (13.32) (properly modified to account for estimation of 0, and 
ôo). 

A weakness of the CMLE approach is that we must specify a density for c; given 
(Zi, Y;ọ), but this is a price we pay for estimating dynamic, nonlinear models with 
unobserved effects. The alternative of treating the c; as parameters to estimate— 
which is, unfortunately, often labeled the “‘fixed effects” approach—does not lead to 
consistent estimation of 0, because of the incidental parameters problem. 

In any application, several issues need to be addressed. First, when are the param- 
eters identified? Second, what quantities are we interested in? As we cannot observe 
c; we typically want to average out c; when obtaining partial effects. Wooldridge 
(2005b) shows that average partial effects are generally identified under the assump- 
tions that we have made. Finally, obtaining the CMLE can be very difficult compu- 
tationally, as can be obtaining the asymptotic variance estimates in equation (13.32). 
If c; is a scalar, estimation is easier, but there is still a one-dimensional integral to 
approximate for each 7. In Chapters 15 through 18 we will see that, under reasonable 
assumptions, standard software can be used to estimate dynamic models with unob- 
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served effects, including effects that are averaged across the distribution of heteroge- 
neity. See also Problem 13.11 for application to a dynamic linear model. 


13.10 Two-Step Estimators Involving Maximum Likelihood 


From the results in Chapter 12 on two-step estimation, we know that consistency and 
asymptotic normality generally apply to two-step estimators that involve maximum 
likelihood. We provide a brief treatment here, covering two cases. The first is when 
the second-step estimation is MLE (and the first step may or may not be). Then we 
cover an interesting situation that can arise when the first-step estimator is MLE. 


13.10.1 Second-Step Estimator Is Maximum Likelihood Estimator 


Assume that we have a correctly specified density for the conditional distribution 
D(y; | x;). Write the model as f(y |x;0,y) for 0 a P x 1 vector and y a J x 1 vector. 
The true density is f(y | x; 0o, Yo). A preliminary estimator of y,, say 7, is plugged into 
the log-likelihood function, and Ê solves 


N 
max > log f(y;|x;0,ĵ). 


We call Ô a two-step maximum likelihood estimator. Consistency of Ê follows from 
results for two-step M-estimators. The practical limitation is that log f(y; | x;;0,y) is 
continuous on © x T and that 0, and y, are identified. 

Asymptotic normality of the two-step MLE follows directly from the results on 
two-step M-estimation in Chapter 12. As we saw there, in general the asymptotic 
variance of /N(@—,) depends on the asymptotic variance of VN (f— y,) [see equa- 
tion (12.41)], so we need to know the estimation problem solved by ĵ. In some cases 
estimation of y, can be ignored. An important case is where the expected Hessian, 
defined with respect to 0 and y, is block diagonal (the matrix F, in equation (12.36) is 
zero in this case). It can also hold for some values of 0, which is important for test- 
ing certain hypotheses. We will encounter several examples in Part IV. 

Of course, we can also apply two-step methods when the second step is a partial 
MLE, resulting in a two-step partial MLE. As usual, consistency and asymptotic 
normality hold under correct specification of the marginal (conditional) density 
for each ¢ and standard regularity conditions. The practical issue is in computing 
an appropriate asymptotic variance. We will not give a general treatment here, as 
the general results on M-estimation can be applied directly when we need them in 
Part IV. 
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13.10.2 Surprising Efficiency Result When the First-Step Estimator Is Conditional 
Maximum Likelihood Estimator 


We now turn to a case where the conclusion seems counterintuitive: under certain 
assumptions, estimating parameters in a first stage, by CMLE, can improve the 
asymptotic efficiency of a second-step M-estimator. To study this phenomenon, as- 


sume that the second-step M-estimator, 0, solves the problem 


N 
i i, Wi, Zi, 9,9 13.64 
To DA Wit >), ( 3 ) 


where we maintain the regularity conditions used in Chapter 12 to apply mean value 
expansions, the uniform law of large numbers, and the central limit theorem. For 
reasons we will see, we have separated the data vector in (13.64) into three vectors, v;, 
w;, and z;. We assume the first-step estimator, 7, comes from a (conditional) MLE 
problem (that satisfies the appropriate regularity conditions): 


N 
max 2 log h(v; | Zz; y), 


where /(-|z;y) is a model of the density underlying D(v;|z;), and we assume this 
density is correctly specified with population value y,. By the information matrix 
equality, we can assume 


N 
VNG — Yo) = {E]d:(7o)d:(70)'1} ND 1 ilo) + 0(1), (13.65) 


i=l 


where d;(y) = V, log A(v; | z; 7)’ is the J x 1 score of the first-step log likelihood. If we 
assume nothing further, we would simply have to derive, and estimate, the asymp- 
totic variance of VN (ô — 0%) using the methods in Section 12.5.2. But suppose we 
add the assumption 


D(v; | Wi, zi) = D(yv; | zi), (13.66) 


which is often called a conditional independence assumption: conditional on z;, v; and 
w; are independent. In a sense, assumption (13.66) means that z; is such a good 
explainor of v;—at least relative to w;—that once we know z;, w; tells us nothing 
about the likelihood of outcomes on y;. This assumption is special, but, as we will see 
in Part IV, it holds under certain stratified sampling schemes as well as in the context 
of estimating treatment effects under so-called ignorability of treatment assumptions. 
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Now, let s;(0,y) = (vi, w;,z:,9,y) = Voq(vi, Wi, Zi,0,y)' be the Px 1 score of 
the second-step objective function, but only with respect to 0. In order to find 
Avar /N(6— 5), we need to find the P x J matrix F, = E[V,8;(9, 7)]. The key 
step is to use the generalized CIME in equation (13.39), where it is important to note 
that, under (13.66), d;(y) is the score for the density of v; given (w;,z;), even though it 
depends only on z;. Therefore, because s;(0,, y) is a function of (v;, w;,z;), we have 


—E[V,s;(9o, Yo) | Wi, Zi] = E[si(@o, Yo) di(y)’ | Wi, Zil, (13.67) 


so, using iterated expectations, we conclude that F, = —E[s;(0o,7,)di(y)’]. But then 
by equation (12.39) and (13.65), 


N 
VN(Ô — 00) = -A3 N"? $ “fs? — E(s?d?”)[E(d? dy") "d? } + 0p(1) 
i=1 


N 
= -A NS g? + op(1), 
= 


i= 


where Ao = E[Vos;(9o, yYo)] is the P x P Hessian of the objective function with respect 
to 0, g? = s? — E(s°d®’)[E(d°d®)] "d? are the population residuals from the popu- 
o/ oo 99 


lation system regression of s? on d? , and the “o” superscript denotes evaluation at 0, 
and y, or just y,. Therefore, 


Avar VN(8 — 05) = A,'DoA,', (13.68) 
where 
Do = E(g;g;") = Var(g?). (13.69) 


If we knew y, rather than estimating it by CMLE, the asymptotic variance of the 
estimator, say 6, would be Avar /N(@— 0.) = A,'BoA,' where Bo = E(s?s?)— 
the usual expected outer product of the score without accounting for the first-step 
estimation (because there is none). But Bo — Do is positive semi-definite (p.s.d.), and 
so the two-step M-estimator is generally more (asymptotically) efficient than the one- 
step M-estimator that uses knowledge of y,. In one case, the two estimators are 
asymptotically equivalent, namely, when E(s?d®’) = 0 (which implies F, = 0). 

An immediate implication of the improvement in efficiency in estimating y, is that 
if we do use f but then ignore the estimation in the second stage, our inference will 
be conservative. In particular, the standard errors computed from A’'BA™!/N are 
larger, or no smaller, than they could be. (As usual, the “*’? means that we replace 
all parameters with their estimates.) Consequently, using a standard econometrics 
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package that computes sandwich standard errors but does not easily account for first- 
step estimation produces conservative inference under assumption (13.66). 

It is also easy to compute standard errors that reflect the increased precision from 
using the MLE, 9, in place of y,. Namely, let §; = s;(0, f) and d; = d;(f) be the scores 
from the second- and first-step estimations, respectively. (Remember, s;(0,y) is the 
score with respect to 0 only.) Then, let g/ be the 1 x P residuals from the multivariate 


regression of $; on å’, i=1,...,N. Then, we obtain 

i N 

D=N')S gg; (13.70) 
i=l 


—_—_ 


and form the sandwich A-'DA'!/N as Avar(6). 


13.11 Quasi-Maximum Likelihood Estimation 


Until now, we have assumed that a (conditional) density function has been correctly 
specified, although in the panel data case we covered situations where we might only 
specify a density separately for each time period (see Sections 13.8 and 13.9). As we 
have seen, if we have a correctly specified density for the conditional distributions 
D(y;| xi) or D(y;,| xix), t= 1,..., T, then MLE or partial MLE has desirable large- 
sample properties for estimating the population parameters. 

There are situations in which it is important to understand the properties of MLE 
methods when densities are misspecified, or partly misspecified. Some authors, nota- 
bly White (1982a, 1994), take the view that all models should be viewed, at least 
initially, as being misspecified, and inference and interpretation should proceed 
accordingly. White (1994) also considers an intermediate view in which part of 
the distribution is correctly specified—the leading case being a conditional mean 
function—while other aspects might be misspecified. As shown by Gourieroux, 
Monfort, and Trognon (1984a), this posture is especially useful for the class of den- 
sities in the linear exponential family, where often we either suspect or know for 
certain that the full distribution is not correctly specified, but we pay careful attention 
to specification of the conditional mean. In the next few subsections, we study the 
properties of estimators that maximize a log likelihood under various degrees of 
misspecification. 


13.11.1 General Misspecification 


In general analyses of maximum likelihood estimation of misspecified models, one 
typically posits the same setup as in Section 13.3, less the assumption that the con- 
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ditional density is correctly specified, stated in equation (13.11). Thus, for MLE with 
a generally misspecified density, there is no “true” value of theta, which we called 0%. 
Instead, it is standard to postulate the existence of a unique solution to the popula- 
tion problem (13.14). Following White (1994), we denote this value by 0°. White 
(1982a, 1994) also discusses the interpretation of 0* as providing the best approxima- 
tion to the true density in the parametric class f(y | x; 0), where closeness is measured 
in terms of the Kullback-Leibler information criterion. This interpretation can be 
inferred from our discussion in Section 13.3. 

When the density is misspecified, we typically call the solution to (13.15), which 
we still denote by 6, a quasi-maximum likelihood estimator (QMLE). Some authors 
prefer the name pseudo-maximum likelihood estimator. We also refer to the log- 
likelihood function as the quasi-log-likelihood (pseudo-log-likelihood) function. 

Consistency of Ê for 0* follows in the same way as when the model is correctly 
specified: by assumption, 0* maximizes Eflog f(y; | x;;0)] over the parameter space 
©, and then we simply assume enough regularity conditions so that the quasi- 
log likelihood converges to its expectation uniformly over ©, just as in Theorem 
13.1. 

Asymptotic inference concerning 0* is more interesting. First, one might legiti- 
mately ask: If the model is misspecified, what does it mean to test hypotheses about 
0*? After all, 0* does not generally index conditional probabilities or conditional 
moments. Nevertheless, if we take the realistic stance that models of conditional 
densities are probably misspecified, the best we can do is to test hypotheses about our 
best approximation to the true density. And sometimes we assume that the main 
model we are interested in is correctly specified, but we estimate an auxiliary model 
as a way to obtain, say, instrumental variables. In such cases, we often do not want 
to assume that the auxiliary model—which is often chosen for computational 
convenience—is correctly specified in any sense. 

It is fairly straightforward to conduct inference on 6*. Without further assump- 
tions, there is only one legitimate estimator of Avar(ĝ): 


Avar(ô )= (Som (ô) Ny (Ss (8)s;(0) r) (Som (ô) ) (13.71) 


where, as before, s;(0) is the P x 1 score vector and H;(0) is the P x P Hessian. (As 
usual, this estimator is “legitimate” in the sense that, when divided by N, the right- 
hand side of (13.71) converges in probability to Avar[VN(6 — 0*)] = A* 'B*A*!, 
where A* = —E[H;(0*)] and B* = E[s,(0*)s,(0*)']; see Section 12.3.) In some cases— 
we cover an important one in the next subsection—we can use the expected Hessian 
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conditional on x;, but, in general, E[H;(0*) | x;] cannot be computed if f(y |x; 0) is 
misspecified. 
The estimate in (13.71) requires computing both first and second derivatives, 


A 


but these calculations can be done numerically, if necessary. Once we have Avar(@), 
forming Wald tests for restrictions on 0* is straightforward. In particular, individual 
asymptotic f statistics are easily obtained because the standard errors of the 6; are the 
square roots of the diagonal elements of (13.71). 

Score testing must be carried out using the fully robust framework in Section 
12.6.2 because the information matrix equality cannot be assumed to hold. Plus, the 
conditional expected value of the Hessian cannot be computed in general, so the 
score test that uses the Hessian (evaluated at the restricted estimates) is the only sta- 
tistic that can be relied on. Fortunately, the statistic in equation (12.68) is always 
nonnegative even though A might not be positive definite (because B is always at 
least p.s.d.). Inference based on the LR statistic is very difficult and not advised: the 
LR statistic no longer has a limiting chi-square distribution (and its limiting distri- 
bution depends on unknown parameters). 

As an example, consider the probit model we carried along in earlier sections, but 
where P(y; = 1|x;) 4 ®(x;0) for all K x 1 vectors 0. In other words, the probit 
model is misspecified. If we obtain Ô by maximizing the probit log likelihood, under 
very weak conditions Ê converges in probability to some 0* e R*, and ®(x0") pro- 
vides the “best” approximation to P(y; = 1|x; = x) in the sense of minimizing the 
Kullback-Leibler distance. Based on the probit model, for a continuous conditioning 
variable, say xj, we would estimate the partial effect of x; on P(y; = 1|x; = x) as the 
partial derivative 0; $(xO"). Therefore, having an estimate of the asymptotic vari- 
ance of @ is critical for obtaining, say, an asymptotic confidence interval for this 
approximate partial effect. Some econometrics packages include simple options for 
computing the sandwich estimator in (13.71), in which case inference under mis- 
specification is straightforward. 

We can also allow for complete density misspecification in the context of 
partial (pooled) MLE. We must allow for a general estimate of the Hessian: 
for each i, H,(0) = Si H;,(0). Further, without assuming that f,(y,|x,;0) is 
correctly specified for each ¢, we can no longer conclude that D(y,,| xi) = 
D(Yir | Xit Yi 1-1 Xit-1;-- -Yj Xa) is sufficient for the scores of the log likelihood 
to be serially uncorrelated (when evaluated now at 0*). Therefore, without further 
analysis, one must use (13.71) as the estimated asymptotic variance, where s;(0) = 
ye i S(O), as in Section 13.8.2. (For commonly used nonlinear models, such as 
probit models, some econometrics software packages include options to compute the 
fully robust matrix (13.70) for panel data applications.) 
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13.11.2 Model Selection Tests 


The properties of maximum likelihood under general misspecification can be used to 
derive a model selection test due to Vuong (1989). As the name implies, the test is 
intended to allow one to choose between competing models. Here we treat the case 
where the two models are, in a sense to be made precise, nonnested. (When one 
model is a special case of the other, the score approach provides a much simpler way 
to test an attractive null model against a more general alternative.) 

If we are content with just choosing the model with the “‘best fit” given the data at 
hand, then it is legitimate to choose the model with the largest value of the log like- 
lihood. After all, whether or not the model is correctly specified, the average log 
likelihood consistently estimates the negative of the Kullback-Leibler information 
criterion (KLIC) for that model. And, we know from Section 13.3, the true density 
maximizes the KLIC. Therefore, a density model cannot be correctly specified if it 
delivers (asymptotically) a lower average log likelihood than another model. Com- 
paring log-likelihood values is analogous to comparing R-squareds in a regression 
context. 

As suggested by Vuong (1989), it is useful to attach statistical significance between 
the difference in log likelihoods. For nonnested models, this turns out to be remark- 
ably easy. Let f\(y|x;01) and f(y|x;@2) be competing models for the density of 

D(y; | x;), where both may be mah pected Let ô, and @ be the QMLEs converging 
to 0; and 65, respectively. Let Zn = eG lim( Ôn) be the quasi-log likelihood eval- 
uated at the relevant estimate for m = 1,2. Then 


(A -— L)/N > Eflog fi(y; |x; 97)] — Ellog Aly; |x; 03), 


where the expected values are over the joint distribution (x;,y;). We can actually say 


more. Using a mean value expansion and the //N-consistency of 6*, for 0%, it can be 
shown that 
N A A 
NA — LZ) = N X Vein (61) — talb) 
i=l 
= N- ayy Za (07) — Zi2(03)] + 0,(1). (13.72) 


Equation (13.72) is the key to obtaining a model specification test because it 
shows that the estimators 0} and @, do not affect that asymptotic distribution of 
N (A — Ly). Therefore, we can obtain an asymptotic normal distribution for 
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N~'/?(4, — Ly) under the null hypothesis 
Ho : Efla (97 )] = El¢i2(@3)). (13.73) 
In particular, under (13.73), 
N 
NV? NYa (07) — 42(63)] & Normal(0, 1), (13.74) 
i=] 


where y? = Var|/;1 (07) — Z2(03)]. A consistent estimator of 4? is 


N 
ñ? = N X aÂ) - lal®)]’. (13.75) 


i= 


= 


Voung’s model selection statistic is 


VMS = N P(A — 2)/ûĝ 


_ NAG G1) - 42ê)] 
{No [en (61) — lal) VN 


f Normal(0, 1), (13.76) 


where the standard normal distribution holds under Hp. Note that, while the nu- 
merator is simply the difference in log likelihoods from the two estimated models, the 
denominator requires computing the squared difference in the log likelihoods for 
each i. A simple way to obtain a valid test is to define d; = 41(01) — Z2(02) for each i 
and then simply to regress d; on unity to test that its mean is different from zero. 
(This version of the statistic subtracts off the sample average of {d; : i = 1,2,...N} in 
forming the variance estimate in the denominator.) 

In applying Vuong’s approach, it is important to understand the scope of its 
application, including the underlying null hypothesis. We should not use the VMS 
statistic and its limiting standard normal distribution for testing nested models under 
correct specification. Recall that the LR statistic is simply LR = (Zr — Y,), and, 
under the null, LR has a limiting Xo distribution, where Q is the number of restric- 
tions. The important point is that, if the models are nested and correctly specified, 
then 4 (0) — ¢i2(05) = (00) — (80) = 0. That is, the difference in log likelihoods 
evaluated at the plims of the estimators is identically zero under the null. This de- 
generacy makes the asymptotic equivalence in equation (13.72) useless for deriving a 
test statistic because the variance y? would be identically zero. For nested models, we 
do not divide the difference in log likelihoods by VN because the result would be a 
statistic that converges in probability to zero. 


Maximum Likelihood Methods 507 


The sense in which the models must be nonnested to apply Vuong’s approach is 
that 


Pia (0}) # 2(83)] > 0. (13.77) 


In other words, the log likelihoods evaluated at the psuedo-true values 6; and 0; 
must differ for a nontrivial set of outcomes on (x;,y;). This not only rules out 
models that are obviously nested, but it rules out other degeneracies, too. For exam- 
ple, if y; is a count variable, x; = (1, x2,..., Xix), and we specify different Poisson 
distributions—the first with mean function exp(x;0) and the second with mean 
function (x;0)’—these models are nonnested provided that the mean of y; given x; 
actually depends on the nonconstant elements in x;. But if E(y;|x;) = E(y;), then 
fi(y|x;67) and fo(y|x;65) are Poisson distributions with the same (constant) 
means, and the limiting standard normal distribution for Vuong’s statistic fails. On 
the other hand, if the competing models are Poisson and geometric, even with the 
same mean function, say exp(x;0), the models are nonnested no matter what, because 
the Poisson and geometric distributions differ even if they both have constant means. 

Because the models must be nonnested, the null hypothesis in equation (13.73) can 
only hold if both models are misspecified. If one model were correctly specified, yet 
the densities differed, then we would have a strict inequality in (13.73) in favor of 
the correctly specified model. We can summarize when it is appropriate to apply 
Vuong’s test: it applies to nonnested models where the null hypothesis is that both 
models are misspecified yet fit equally well (in the sense that they have the same 
expected log likelihoods). 

If we reject model 2 in favor of model 1 because VMS is statistically greater 
than zero, then we can only conclude that model 1 fits better in the sense that 
E[4 (07 )] > E[Zj2(@5)]. It does not mean that model | is correctly specified (although 
it could be). There are many models that can fit better than a given model, and 
clearly not all can be correct. 

Naturally, Vuong’s approach applies directly to panel data methods when two 
complete densities have been specified for D(y;,,...,y;7 | Xa,---,Xir). But it can also 
be extended to partial (pooled) MLEs, provided we properly account for the time 
series dependence. For each 1, let fi(y,|x;01) and fi2(y,|x;02) be competing 
models of the conditional density in each time period. As in Sections 13.8 and 13.9, 
the log likelihoods are 


T T 
Lim(Om) = log Tin(Vir | Xit; Om) = 5 lim(Om), m= L; 2. (13.78) 
f=1 t=1 
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The same null hypothesis, (13.73), makes sense in the PMLE setting (and is the 
weakest sense in which the models fit equally well). Moreover, the convergence result 
in equation (13.74) still holds under the null. Assuming the nonnested condition 
(13.77) is satisfied, the variance 77 is positive. However, in estimating 77, we must 
account for the serial dependence in {41 (07) — Zi2(03) :t=1,..., T}. Let diy = 
fin (61) — es denote the difference in estimated log fikelinoads for each z, and let 
i= N- 15A 1 dir. Then ĝ? is easily obtained as 


t T 
ñ? > iy? + à- Andi}. (13.79) 
=1 Vel =l r#t 
This variance estimator allows for the possibility that the mean difference in log 
likelihoods varies across t under the null, but that the averages across ¢ are the same. 
If the null hypothesis is the stronger version, E[/ la (0D) = E[¢i2(85)] for t= 1,..., 7, 
then ht can be replaced with the average of dą across i and t, say Â. In ihis < case, 
the test statistic is simply the z statistic Â/se(Â), where se(A) is the heteroskedasticity 
and serial correlation robust standard error from the pooled regression dy on 1, 
PS E A a E A 

Vuong’s model selection test should not be confused with specification tests in the 
context of nonnested models. For example, the Cox (1961, 1962) approach tests a 
specified model against a nonnested alternative, and a key component of the test is 
the average difference in log likelihoods, (4% — %)/N. But with Cox’s approach, 
one model is taken to be the correct model ander the null hypothesis. (And, in prac- 
tice, each of two models is taken to be the null model and then tested against the 
other.) If one model is correct and the models are truly nonnested, then the expected 
values of the log likelihoods, evaluated at the plims of the MLE (for the null model) 
and quasi-MLE (for the alternative model), must differ; computation of the Cox 
statistic requires estimating the mean of the difference in log likelihoods, conditional 
on x;, under the null hypothesis. In some cases, including when both distributions 
are normal but with different means or variances (or both), and binary response 
models, the Cox statistic is easy to compute. But generally finding the conditional 
mean difference in the log likelihoods in closed form is intractable. (The mean can be 
simulated, but how much trouble does one want to go through for a specification 
test?) 

The Cox test can be cast as a conditional moment test when we extend the frame- 
work in Section 13.7 to allow for the moment conditions to depend on parameters in 
addition to 0, as done in White (1994, Chap. 9). For further discussion and several 
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other approaches to testing a model against a nonnested alternative, see Gourieroux 
and Monfort (1994) and White (1994, Chap. 10). 


13.11.3 Quasi-Maximum Likelihood Estimation in the Linear Exponential Family 


Section 13.11.1, and the model selection test in the previous subsection, allowed that 
nothing about the model f(y | x; 0), or f:(y,|x.;0), t= 1,..., T, is correctly specified. 
Sometimes, a useful middle ground between correct specification and complete mis- 
specification is to allow just one or two features of a conditional density to be cor- 
rectly specified. In this section, we study a class of QMLEs that consistently estimate 
the parameters in a correctly specified conditional mean. 

As one learns in introductory econometrics—although it may not be stated in quite 
this way—the ordinary least squares (OLS) estimator is a QMLE: OLS can be 
obtained by maximimizing the Gaussian (normal) log-likelihood function with a 
conditional mean linear in parameters; in fact, we can arbitrarily set the variance 
equal to any fixed value without affecting the estimates of the mean parameters. Not 
surprisingly, the nonlinear least squares estimator we covered in Chapter 12 has the 
same property: minimizing the sum of squared residuals is the same as maximizing 
the Gaussian quasi-log likelihood. Therefore, NLS is a QMLE based on the normal 
density function. 

The normal log likelihood is not the only log likelihood that identifies the param- 
eters of a conditional mean despite arbitrary misspecification of the remaining fea- 
tures of the distribution. The Bernoulli log likelihood, the Poisson log likelihood, the 
exponential log likelihood, and others share these feature. These log likelihoods are 
all members of the linear exponential family (LEF), and it is useful to provide a 
somewhat general treatment of this class of QMLEs. For simplicity, we consider the 
case of a scalar response, even though there are results for multiple responses, too. 
We draw on Gourieroux, Monfort, and Trognon (1984a), or GMT. 

A log likelihood in the LEF can be written as a function of the mean as 


log f(y |u) = afu) + B(y) + ye(u), (13.80) 


for functions a(-), b(-), and c(-), where u is a candidate value of the mean of y; and 
lo = E( y;) is the true population mean. Let M denote the set of possible values of the 
mean. GMT (1984a) show that u, solves 

maxļa(u) + E(y:)e(u)] = maxļa( y) + uc 1). (13.81) 
ueM 4EM 

The functions in (13.80) are easily obtained for the normal, Bernoulli, Poisson, expo- 
nential, and other cases; GMT (1984a) contains a summary table. For the Bernoulli 
distribution, 
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log f(y| u) =(1— y) -log(1 — u) + y - log(u) 
= log(1 — u) +y- loglu/( =- 4)]), O<yu<l. 


Therefore, a( u) = log(1 — u), b(y) =0, and c(u) = log[u/(1 — u)]. For some dis- 
tributions to fit into the LEF, notably the gamma and negative binomial, a nuisance 
parameter must be fixed at a specific value; see GMT (1984a) for details. 

In the most popular examples, it is easy to directly verify that 4#, solves (13.81). We 
will consider special cases in Part IV. For now, it is important to understand the 
meaning of the result. Consider the Bernoulli case but, rather than assuming y; is a 
zero-one variable, let y; be any random variable with support in the unit interval, 
(0, 1]. y; can be discrete, continuous, or have both features. For example, we could 
have P(y; = 0) > 0 but P(y; = y) =0 for ye (0,1], or y; might take on values in 
{0,1/m;,2/m;,...,1} for some positive integer m;. Regardless of the nature of yj, 
provided its mean 4, is in (0,1), 4, maximizes the expected value of the Bernoulli log 
likelihood. 

In practice, we are interested in conditional rather than unconditional means, 
which we parameterize as m(x, 0). Then the conditional quasi-log likelihood function 
becomes 


log f(y | m(x, 0)) = a(m(x, 0)) + b( y) + yc(m(x, 0)). (13.82) 


Because the mean is now assumed to be correctly specified, we assume there is 0, € © 
such that E(y;|x;) = m(Xx;, 0o). A simple iterated expectations argument shows that 
0, solves 
max Ela(m(x;, 0)) + yic(m(x;, 9))], (13.83) 
regardless of the actual distribution D(y;|x;). Again, it is important to understand 
the meaning of this result. The nature of y; need not even correspond to the chosen 
density. For example, y; could be a nonnegative, continuous variable, and we use 
the Poisson quasi-log likelihood. The Poisson QMLE is consistent for the condi- 
tional mean parameters provided the mean—with the leading case being an expo- 
nential function—is correctly specified. The only restriction is that candidates for 
E(y; | x; = x) should have the same range as allowed in the chosen LEF density. 
One of the most useful characterizations of QMLE in the LEF is based on the 
score. It can be shown that the score has the form 


s;(0) = Vom(x;, )'|y; — m(x;, )|/v(m(x;, 0)) (13.84) 
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where Vgm(x;, 0) is the 1 x P gradient of the mean function and, of importance, v(x) 
is the variance function associated with the chosen LEF density. For the standard 
normal, v( u) = 1, for the Bernoulli, v(u) = u(1 — u), for the Poisson, v(4) = u, and 
for the exponential, v(u) = wu’. The structure in equation (13.84) shows immediately 
that the QMLE is Fisher consistent: if E(y;|x;) = m(X;, 0o), then E[s;(0,) | xi] = 0, 
which in turn implies that the unconditional mean of the score is zero. We 
derived (13.84) explicitly in the probit example (Example 13.1), and it is also easily 
seen to be true for the Poisson regression (Example 13.2). But here we emphasize that 
E[s;(90) | Xi] = 0 holds for arbitrary misspecification of these densities provided the 
mean is correctly specified. 
We can also use the score to compute the expected Hessian conditional on x;: 


A(x;,95) = —E[H;(0.) | xi] = Vom(x;, 00)’ Vom(x;, Oo) /v(m(x;, Ao). (13.85) 
Further, 
E[s;(0o)8;(8o) | xi] = E(u? | x;)Vom(x;, 00)'Vom(x;, 00) /[v(m(x;, 9o))]”, (13.86) 


where u; = yi — m(Xx;, 0o). It follows immediately that the CIME holds if E(u? |x;) = 
v(m(Xi, 0o)), that is, 


Var(y;| Xi) = v(m(x;, 0o)). (13.87) 


(In the Bernoulli case, v(m) = m(1 — m), and in the Poisson case, v(m) = m.) In other 
words, if the chosen LEF density has a conditional variance equal to the actual 
Var(y;|x;), then we can use the usual MLE standard errors and inference (even if 
features of the distribution other than the first two conditional moments are mis- 
specified). So, for example, in a Poisson regression analysis, if Var( y; | x;) = E( y; | xi) 
and the mean function is correctly specified, we can act as if we are using MLE rather 
than QMLE, even if higher order conditional moments of y; do not match up with 
the Poisson distribution. 

If Var( y; | x;) is unrestricted, the IME will not hold, and then the robust sandwich 
estimator 


= N AFN N -1 
avai) = ( Aisi) (Easo) (Eao) i (13.88) 
1 i=] 


i= i=1 


should be used, where A(x;, Ê) is given in (13.85) with Ê replacing 05. Because of the 
particular structure of the log likelihood and the assumption that the conditional 
mean is correctly specified, we can find E[H;(0,)|x;]. If we did not want to assume 
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correct specification of the conditional mean, we would be in the setup of Section 
13111: 

QMLE in the LEF is closely related to the so-called generalized linear model 
(GLM) literature in statistics. The terminology and some particulars differ, and the 
early GLM literature did not recognize the robustness of the approach for estimating 
conditional mean parameters, but in modern applications the key feature is that they 
both use QMLE to estimate parameters of a conditional mean. The GLM approach 
is more restrictive in that the conditional mean is assumed to have an index structure. 
In particular, the mean is assumed to have the form m(x,0) = r(x0), where the 
“index” xð is linear in parameters and r(-) is a function of the index. In addition, an 
important component of the GLM apparatus is the link function, which implicitly 
defines the mean function. If we let 7 denote the index x9, then the link function g(-) 
is such that 7 = g( u). The link function is strictly monotonic and therefore has an 
inverse, and so ~=g'(y) or, in the notation of conditional mean functions, 
m(x, 0) = g`! (x0). 

The term “generalized linear model” comes from the underlying linearity of the 
index function, and then the link function introduces nonlinearity. In most applica- 
tions, it is more natural to specify the conditional mean function because we want the 
mean function to be consistent with the nature of y;, and y; is the outcome we hope 
to explain. (So, for example, if y; is nonnegative, we want m(x, 0) to be positive for 
all x and 6; if 0 < y; < 1, we want m(x,@) to be in the unit interval.) Once the mean 
function is specified, we use a suitable LEF density; this is the approach taken by 
GMT (1984a). Directly specifying m(x, 0) does not wed one to the index structure, 
although in most applications, m(x,0) has an index form. If, say, m(x, 0) = exp(x0) 
then the link function is g( u) = log( u) for u > 0. If m(x, 0) = exp(x6)/[1 + exp(x6)], 
then g(u) = log|u/(1 — u)] for 0 < u < 1. McCullagh and Nelder (1989) is a good 
reference for GLM. 

The GLM literature recognized early on that assumption (13.87) was too restric- 
tive for many applications. As we discussed, no variance assumption is needed for 
consistent estimation of 0o. An assumption that has been used in the GLM literature 
allows Var(y;|x;) to differ from that implied by the LEF distribution by a constant: 
Var(;|x;) = o2v(m(x;, 0o)) (13.89) 


o 


for some a2 > 0, which is often called the dispersion parameter. Because of its his- 
torical role in GLM analysis, we refer to (13.89) as the GLM variance assumption. 
When o2 > 1, then we say there is overdispersion (relative to the chosen density); 
underdispersion is when o> < 1, and both cases arise in practice. 
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Under (13.89), it is straightforward to estimate o?. Let u; = Ji — m(Xi, 0o) be 
the additive “errors.” Then, because E(u? | x;) = almat) = 021;, it follows by 
iterated expectations that 


E(u? /v;) = E[E(u?/v;|x;)] = E[E(u? | x,)/v;] = E(e2;/v;) = 02. (13.90) 


Therefore, by the usual analogy principle argument, 
N 
& = (N - PY’ > aii (13.91) 


is consistent for 2, where a; = y; — m(x;,Ô) are the residuals, ô; = v(m(x;,0)) are 
the estimated éonditional variances from the LEF density, and the degrees-of- 
freedom adjustment is common (but, of course, does not affect consistency). In the 
GLM literature, the standardized residuals 4;/./é; are called the Pearson residuals 
and the estimate in equation (13.91) is the Pearson dispersion estimator. 

Under the GLM variance assumption, it is easily seen that the generalized IME, 
given in equation (12.53), is satisfied (once we account for the minor difference 
between a minimization and maximization problem). In fact, a conditional version 
holds, which we can call the generalized conditional information matrix equality 
(GCIME): 


E[s:(00)8:(00)' | xi] = —o2E[H;(0,) | xi] = o2A(x;, 90). (13.92) 
From Chapter 12, we can take 


-1 


-1 
Avar(6 =a (Soa x;,9 ») = a(S vma )'Vom;(0 Hi) i (13.93) 


where the notation should be clear. Most software packages that have GLM 
commands—typically requiring one to specify the LEF density and the link 
function—allow computation of the nonrobust variance matrix estimator that 
assumes (13.87) (which is the same variance estimate under full MLE), the estimator 
n (13.93), or the fully robust form in (13.88). Therefore, the GLM framework can 
be used to obtain QMLEs for the LEF for a certain class of mean functions. 

Notice how similar the structure of equation (13.93) is to the asymptotic variance 
of the weighted nonlinear least squares estimator from Chapter 12; see equation 
(12.59). In fact, the only difference between (13.93) and (12.59) is that with WNLS we 
allow the parameters in the variance function to be different from 0,, and, therefore, 
to be estimated in a separate stage (usually after an initial NLS estimation). The 
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similarity between (13.93) and (12.59) hints at a close link between QMLEs in the 
LEF and WNLS. In fact, for every QMLE in the LEF, there is an asymptotically 
equivalent WNLS estimator when the conditional mean is correctly specified: for 
WNLS, take h(x, y) = v(m(x, 0)) and ĵ an initial \/N-consistent estimator of 0, (such 
as NLS). The QMLE and WNLS estimators will be VN-equivalent whether or not 
(13.89) holds. 

If we nominally wish to assume (13.89), QMLE is computationally convenient 
because it is a one-step estimation procedure, and the objective function is usually 
well behaved; WNLS requires two steps. Therefore, when the variance and mean are 
thought to be related in a way suggested by an LEF density—more precisely, its 
GLM extension in (13.89)—QM.LE is almost always used. And, of course, we can 
make inference fully robust to any variance-mean relationship by using the sandwich 
estimator (13.88). 


13.11.4 Generalized Estimating Equations for Panel Data 


Naturally, we can extend QMLE in the LEF to panel data. The simplest approach 
is to specify conditional mean functions m,(x;,0), t=1,...,7, along with an LEF 
density, and then to proceed with estimation by ignoring any time dependence. The 
pooled quasi-likelihood has the same form as equation (13.45), where fi(y:|X:; 0) is 
in the LEF. (Allowing the mean and density functions to depend on ¢ is probably not 
necessary, but it reminds us that we often want to allow some parameters to change 
over t—for example, m,(x,,0) = exp(a,+x,f). In practice, we would accomplish 
different “intercepts” within the exponential function by including a full set of time 
period dummies among the regressors.) 
Correct specification of the mean for each ¢ means that, for some 0, 


E( vit |X) = (Xt, 90), AE ee Be (13.94) 


Notice that (13.94) does not assume strict exogeneity of {x;:f=1,...,7}. The 
score for each ft has the same form as equation (13.84): 


S40) = Vom: (Xi, O) [yu — m (Xi, 8)] v(m; (Xi, 9)), (13.95) 


and the partial QMLE (or pooled QMLE) is generally found by solving 
Di Da SelÊ) = 0. 

As we discussed in Section 13.8 for partial MLE, the scores {s;,(0.) : t= 1,..., T} 
are generally serially correlated. However, in Section 13.8.3 we saw an important 
case in which the scores are not serially correlated: the distribution D(y,, |X) is dy- 
namically complete in the sense that it also equals D(y; | Xi, Y; 1 Xi,r-15--- Yj Xa) 
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Here, we are not assuming that we have a full density /;(y,|x,;0) correctly specified. 
Nevertheless, using equation (13.95), we can easily see that if the conditional mean is 
dynamically complete in the sense that 


E( vie | Xie) = E( vir | Xin, Vit-1,Xi,-1,---, Va, Xa); (13.96) 
then 
E[si(Oo) | Xir, Vir-1, Xi,1-1,--+, Va, Xa)] = 9, (13.97) 


and so the scores evaluated at 0, are necessarily serially uncorrelated. Importantly, 
this finding has nothing to do with whether other features of the LEF density, 
such as the conditional variance, are correctly specified. Without any assumptions on 
Var( Yi |X), the appropriate asymptotic variance estimator of the pooled QMLE is 


N N T/N N -l7N N =l 
(è 5 voor) (£ 5 Vur Vaina] 5 voor) 
i=l i=l i=l i=l i=l i=l 
where the notation should be clear. This estimator is precisely what is reported from 
a pooled QMLE analysis—in most cases using a GLM routine—that ignores the 
time dimension but allows the variance to be misspecified. 

If we further add the GLM variance assumption Var( yi | xi) = o20(m(Xit, Oo)) for 
all t—including that the scale factor o? is constant across t—then the pooled ana- 
logue of equation (13.93) is valid (where 6? is obtained from the sum of squared 
standardized residuals across i and t). Some econometrics packages allow computa- 
tion of these variances as a routine matter, and they also compute the variance matrix 
that does not assume dynamic completeness in the conditional mean. 

Naturally, if the mean is not dynamically complete, then, if we make the stronger 
assumption of strict exogeneity of the regressors, it is possible to obtain a more effi- 
cient estimator. In Section 12.9, we studied the multivariate WNLS estimator, and 
noted that a common way to choose a nonconstant “working” variance matrix was 
to specify models for the conditional variances—which may be misspecified—along 
with a constant “working” correlation matrix. This is common in the generalized 
estimating equations (GEE) approach to panel data models. For practical purposes, 
GEE is a special case of WMNLS, where the variance functions are chosen from the 
LEF of distributions (and the mean functions are commonly chosen from linear, ex- 
ponential, logistic, and probit, as in standard GLM analysis). 

For the T x 1 vector y;, we assume 


E(y;|x:) = m(x;, 0o), some 0, € O, (13.98) 
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where x; is the collection of all regressors across all time periods and m(x, 0) is T x 1. 
Notice that the vector E(y;|x;) has elements E(y|x;), which means that the 
regressors are strictly exogenous. When the mean function at time ¢ does not depend 
on all of x;, (13.98) implies that the regressors excluded at time ¢ can have no partial 
effect on E( yy | x;). An important case where all elements of x; appear for each ¢ is in 
a correlated random coefficient setup, but then restrictions are obtained using a strict 
exogeneity assumption conditional on the heterogeneity. We return to this situation 
below. Generally, one should not apply GEE methods with a nondiagonal working 
variance matrix (see Section 12.9) unless the regressors satisfy a strict exogeneity 
assumption. 

In the most common applications of GEE, we specify, for each ¢, a quasi-log like- 
lihood from the LEF. As in the cross section case, this choice is motivated by the 
nature of y. Assuming we have a suitable parametric model m/(x;;, 0) for the mean, 
and have chosen an LEF, GEE analysis is straightforward once we choose the 
working correlation matrix (see Section 12.9). Typically, this correlation matrix is 
constant, depending on parameters p, so we can write R(p) as the T x T matrix of 
proposed correlations. Because we do not allow this matrix to depend on x;, there 
can be no presumption that Corr( yi, Vis | Xi) = Tis(P.). A key feature of GEE is that 
we explicitly recognize that the conditional correlations might not be constant—in 
fact, often we know almost certainly they are not—but we apply weighted multi- 
variate nonlinear least squares anyway to hopefully improve efficiency over just a 
pooled QMLE analysis. 

As mentioned in Section 12.9, the two most common working correlation matrices 
for panel data are the unstructured and the exchangeable. The latter conserves on 
parameters in the working variance matrix, but with large N (and not so large T) 
parsimony might not be necessary. In any case, we need to combine a working 
correlation matrix with a nominal assumption about the variances. The variances 
naturally come from the chosen LEF density. So, if we are using the Bernoulli 
quasi-log likelihood with mean function ,(x;,0), then we use the nominal variance 
m,(xit,9)[1 — m;(xiz, 0)]. For the Poisson QLL, we use m;(xj;,0). The mean param- 
eters are estimated by pooled QMLE in a first stage, say 8. Then the working corre- 
lation matrix can be estimated using standardized residuals ù;r/včČi (see Section 
12.9). Then we can form the working variance matrix, W; = vi! RV)! ? where V; = 
diag(v; (8), ...,v;r(@)) depends on the first-step estimates. We then apply WMNLS, 
or work off the first-order conditions. It is important to use the asymptotic variance 
matrix in equation (12.98) that allows the working variance matrix to be misspecified, 
whether it is due to the variances being misspecified or the working correlation ma- 
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trix being misspecified. (As we discussed in Section 12.9, equation (12.98) is not valid 
if the conditional mean is misspecified, and the GEE literature sometimes refers 
to this estimate as a “semirobust’’ estimate.) If one believes the variance matrix is 
correct up to a dispersion factor, then 6? is obtained along with other parameter 
estimates. 

Pooled QMLE and GEE methods are attractive for conditional mean models with 
unobserved heterogeneity, say c;, and strictly exogenous regressors conditional on 
ci E(vie| Xa,---,Xi7, Ci) = E( Yi | Xi ci). When we combine this assumption with 
a specific correlated RE assumption, such as c; = « + X;€ + a;, where X; is the vector 
of time averages and a; is independent of (xj,...,x;r), then we can often find 
E(yir|Xi,---,Xir) as a function of (xj, X;) (or simply assert that it takes on a con- 
venient form). GEE (including pooled QMLE) becomes an indispensable tool to 
ensure our inference is robust to misspecification of the serial correlation structure we 
adopt. We will not provide a general treatment now, but we draw heavily on this idea 
in several chapters in Part IV. 


Problems 


13.1. If f(y| x; 6) is a correctly specified model for the density of y; given x;, does 0, 
solve maxgc@ El f(y; | xi; 0)]? 


13.2. Suppose that for a random sample, y;|x; ~ Normal[m(x;,B,),02], where 
m(x, B) is a function of the K-vector of explanatory variables x and the P x 1 param- 
eter vector $. Recall that E(y;|x;) = m(x;, $o) and Var(y;|x;) = o2. 

a. Write down the conditional log-likelihood function for observation 7. Show that 
the CMLE of fo, P, solves the problem ming ye — m(x;,B)]|°. In other words, 
the CMLE for $, is the nonlinear least squares estimator. 

b. Let 0 = (f’,c7)’ denote the (P+ 1) x 1 vector of parameters. Find the score of 
the log likelihood for a generic i. Show directly that E[s;(0) | x;] = 0. What features 
of the normal distribution do you need to be correctly specified in order to show that 
the conditional expectation of the score is zero? 

c. Use the first-order condition to find ô? in terms of B. 

d. Find the Hessian of the log-likelihood function with respect to 0. 

e. Show directly that —E[H,(.) | x;] = E[s;(@.)s;(05)' | xi]. 

f. Propose an estimated asymptotic variance of Ê, and explain how to obtain the 
asymptotic standard errors. 
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13.3. Consider a general binary response model P(y; = 1|x;) = G(X;, 0o), where 
G(x, 0) is strictly between zero and one for all x and 0. Here, x and 0 need not have 
the same dimension; let x be a K-vector and 0 a P-vector. 


a. Write down the log likelihood for observation i. 
b. Find the score for each i. Show directly that E[s;(0,) | x;] = 0. 


c. When G(x, 0) = ®[xf + ôi (x£)? +62(xf)°], find the LM statistic for testing Ho: 
6o1 = 0,602 =0. 


d. How would you compute a variable addition version of the test in part c? 


13.4. In the Newey-Tauchen-White specification-testing context, explain why we 
can take g(w, 0) = a(x, 0 )s(w, 0), where a(x, @) is essentially any scalar function of x 
and 0. 


13.5. In the context of CMLE, consider a reparameterization of the kind in Section 
12.6.2: 6 = g(0), where the Jacobian of g, G(@), is continuous and nonsingular for all 
0 € ©. Let s?7(¢) = s”[g(0)] denote the score of the log likelihood in the reparam- 
eterized model; thus, from Section 12.6.2, s/(¢) = [G(0)']~'s;(0). 

a. Using the conditional information matrix equality, find A‘(¢,) = 
E[s’(¢,)s?(@,)’ | x;] in terms of G(0,) and A;(0,) = E[s;(0o)si(9o)' | xi]. 

b. Show that AY = G’"!A;G"!, where these are all evaluated at the restricted esti- 


mate, 0. 


c. Use part b to show that the expected Hessian form of the LM statistic is invariant 
to reparameterization. 


13.6. Suppose that for a panel data set with T time periods, y; given X; has a 
Poisson distribution with mean exp(x;,9.), t= 1,...,T. 


a. Do you have enough information to construct the joint distribution of y; given x;? 
Explain. 
b. Write down the partial log likelihood for each i and find the score, s;(0). 


c. Show how to estimate Avar(@); it should be of the form (13.53). 


d. How does the estimator of Avar(@) simplify if the conditional mean is dynami- 
cally complete? 


13.7. Suppose that you have two parametric models for conditional densities: 
g(yı |_y2,x;0) and A( y» |x; 0); not all elements of 0 need to appear in both densities. 
Denote the true value of 0 by 0s. 
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a. What is the joint density of (y1, y2) given x? How would you estimate 0, given a 
random sample on (x, y1, y2)? 

b. Suppose now that a random sample is not available on all variables. In particular, 
yı is observed only when (x, y») satisfies a known rule. For example, when y, is 
binary, yı is observed only when y, = 1. We assume (x, y») is always observed. Let 
ry be a binary variable equal to one if yı is observed and zero otherwise. A partial 
MLE is obtained by defining 


C(O) = ra log g(Ya | Vi2, X13 8) + log Aya | xi; 9) = rala (9) + Z2(9) 
for each 7. This formulation ensures that first part of 7; only enters the estimation 


when y; is observed. Verify that 0, maximizes E|/;(0)| over ©. 


c. Show that —E[H;(05)] = E[s;(@5)s;(@0)'], even though the problem is not a true 
conditional MLE problem (and therefore a conditional information matrix equality 
does not hold). 


d. Argue that a consistent estimator of Avar VN(0 — 0) is 
w -1 
N! X fnÂn + Ân) 
i=l 
where Aj (9) = —E[Vp C1 (90) |.vi2, Xi], Ai2(O0) = —E[Vp7j2(9o) | xi], and 6 replaces 
0, in obtaining the estimates. 


13.8. Suppose that for random vectors y,, x;, and wi, 


D(y; | xi, wi) = D(y;|x:, g(Wi, %o)) = D(y; | xi, g)), 


where g(w,y) is a known function of w and the J parameters y and g; = g(wi,7,) 
(which is unobserved because we generally do not know y,). Assume that we have a 
correctly specified density for D(y;|x;,g;), f(y|x,g;@), and let the P x 1 vector 0, 
denote the true value. Let f be an //N-asymptotically normal estimator of y, that 
satisfies 


N 
VN (9 — yo) = NV? Y r(wi, 7o) + 0p(1), 


i=l 


and let Ê be the two-step nonlinear least squares estimator that solves 


N 
max X log f(y; | Xi, (Wi, 7); 0). 
68 =I 
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a. Show that, under standard regularity conditions, the asymptotic variance of 
VN(0 — 0.) is at least as large as the asymptotic variance if we knew y,. 

b. Propose a consistent estimator of Avar VN (Ô — 0%). 

c. Suppose that y; is a scalar and P(y; = 1|x;,w,;) = B(x;P, + pogi), where w; = 
(Zi, hi), gi = hi — Ziya, and E(z/g;) = 0. Let 0 = (f',f)' be the two-step probit esti- 
mators from probit of y; on x;, and ĝ;, where g; = hi — z;p are OLS residuals from A; 
on z;,i=1,...,N. How does part a apply in this case? 

d. Show that, when p, =0, the usual probit variance matrix estimator of 6 is 
asymptotically valid. That is, valid inference is obtained by ignoring the first-stage 
estimation of y,. 


e. How would you test Hy : p, = 0? 


13.9. Let {y,:t=0,1,...,7} be an observable time series representing a popula- 
tion, where we use the convention that ¢ = 0 is the first time period for which y is 
observed. Assume that the sequence follows a Markov process: D(y,| Y1, ¥;2;---Yo) 
= D(y,|y-1) for all t= 1. Let f,(y,|¥-1;8) denote a correctly specified model for 
the density of y, given y,_;, £ > 1, where @, is the true value of 0. 


a. Show that, to obtain the joint distribution of (yọ, y2,..., Yr), you need to cor- 
rectly model the density of yo. 


b. Given a random sample of size N from the population, that is, (V9, ¥i1,---; Yir) 
for each i, explain how to consistently etimate 0, without modeling D( yọ). 


c. How would you estimate the asymptotic variance of the estimator from part b? Be 
specific. 


13.10. Let y be a Gx 1 random vector with elements y,, g = 1,2,...,G. These 
could be different response variables for the same cross section unit or responses at 
different points in time. Let x be a K-vector of observed conditioning variables, and 
let c be an unobserved conditioning variable. Let /,(-|x,c) denote the density of y} 
given (x,c). Further, assume that the y,, y2,..., Yg are independent conditional on 
(x, c). 

a. Write down the joint density of y given (x, c). 

b. Let A(- |x) be the density of c given x. Find the joint density of y given x. 


c. If each f,(-| x, c) is known up to a F,-vector of parameters yf and h(- |x) is known 
up to an M-vector ôo, find the log likelihood for any random draw (x;, y;) from the 
population. 


d. Is there a relationship between this setup and a linear SUR model? 
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13.11. Consider the dynamic, linear unobserved effects model 
Vit = PYit-1 + Ci + eit, ta 2 E 
E(Cit | Vi t-19 Vi,t-29-+ +> Vion Ci) = 0 


In Section 11.6.2 we discussed estimation of p by instrumental variables methods 
after differencing. The deficiencies of the IV approach for large p may be overcome 
by applying the conditional MLE methods in Section 13.9.2. 


a. Make the stronger assumption that ;,| (9; +1, Yit-2--- Vio, Ci) is normally dis- 
tributed with mean py; -; +c; and variance a2. Find the density of (y,,.--, Vir) 
given (y,9,¢;). Is it a good idea to use the log of this density, summed across i, to 
estimate p and a? along with the “fixed effects” c;? 

b. If ci | vio ~ Normal (ao + a1 yi0,G2), where g? = Var(a;) and a; = c; — &o — “Yio, 
write down the density of (ya, ..-, Yir) given y;ọ. How would you estimate p, a0, %1, 
aż, and o2? 

c. Under the same assumptions in parts a and b, extend the model to y; = PYi,t-1 + 
Ci + ÔCiYi -1 + eu. Explain how to estimate the parameters of this model, and pro- 
pose a consistent estimator of the average partial effect of the lag, p + ôE(c;). 


d. Now extend part b to the case where z;,B is added to the conditional mean func- 
tion, where the z; are strictly exogenous conditional on c;. Assume that c; | Yio, Z; ~ 
Normal(xo + %1 yi0 + Z;ð, g); where Z; is the vector of time averages. 


13.12. In the context of GLM, for a given LEF density, there are many ways to 
characterize what is known as the canonical link function. One is that g(-) is the 
canonical link if the first-order conditions for the QMLE can be expressed as 


Saly- aÂ = Yxi -man Â) = 0, 
i=l 


where m(x, 0) is the corresponding mean function and 6 is the QMLE. Alternatively, 
using the canonical link implies that the Hessian of the quasi-log likelihood (QLL) 
does not depend on y;; consequently, the Hessian and the expected value of the 
Hessian given x; are identical. 


a. Define residuals as usual, a; = y;—m(x;,@). Argue that if x; contains a 
constant—as it almost always does in GLM analysis—then the residuals obtained by 
using the canonical link always average to zero. (This extends the well-known result 
for linear regression.) 
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b. Show that for the Bernoulli QLL, the canonical link function is g(x) = 
log|u/(1—)], so that the mean function is the logistic function, exp(x0)/ 
[1 + exp(x9)]. 

c. Show that for the Poisson QLL, the canonical link is g(2) = log(). What is the 
mean function? 


d. Evaluate the following statement: “In a GLM analysis using the canonical link, 
our estimate of the asymptotic variance of 0 is the same whether or not we assume 
the mean function is correctly specified.” 


13.13. Let 0° be the solution to the maximization problem 


max E[/(w;,0)], 


where /(w, 0) is a quasi-log likelihood satisfying the conditions of Theorem 12.3 (but 
with maximization rather than minimization). Assume that 0* is in the interior of ®© 
and that the gradients and expectation can be interchanged. Let Ê be the quasi- MLE. 
Show that 


N N 
NPY (wÂ) = NS (wi 0) + oC). 
i=l a 


Appendix 13A 


In this appendix we cover some important properties of conditional distributions 
and conditional densities. Billingsley (1979) is a good reference for this material. 
For random vectors y € Y c R® and xe X c RÝ, the conditional distribution of y 
given x always exists and is denoted D(y| x). For each x this distribution is a proba- 
bility measure and completely describes the behavior of the random vector y once x 
takes on a particular value. In econometrics, we almost always assume that this 
distribution is described by a conditional density, which we denote by p(-|x). The 
density is with respect to a measure defined on the support ¥ of y. A conditional den- 
sity makes sense only when this measure does not change with the value of x. In 
practice, this assumption is not very restrictive, as it means that the nature of y is 
not dramatically different for different values of x. Let v be this measure on R”. If 
D(y|x) is discrete, v can be the counting measure and all integrals are sums. If 
D(y|x) is absolutely continuous, then v is the familiar Lebesgue measure appearing 
in elementary integration theory. In some cases, D(y| x) has both discrete and con- 
tinuous characteristics. 
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The important point is that all conditional probabilities can be obtained by inte- 
gration: 


Pye 4 |x = 2) = | ply|2)v(dy), 


where y is the dummy argument of integration. When y is discrete, taking on the 
values y,, 42,- -, then p(-| ~) is a probability mass function and P(y= y;|x= æ) = 
ply le) j= 1,2)... 

Suppose that f and g are nonnegative functions on R™”, and define P = 
{ze RY : f(x) > 0}. Assume that 


1 =| I (2)v(dz) = | g(x) v(dz), (13.99) 
I Sy 
where v is a measure on R“. The equality in expression (13.99) implies that f is a 
density on R”, while the inequality holds if g is also a density on R”. An important 
result is that 


(f; 9) = [ log[ f(«)/g(«)|f(«)v(de) = 0. (13.100) 


(Note that %(f;g) = œ is allowed; one case where this result can occur is f(z) > 0 
but g(x) = 0 for some z. Also, the integrand is not defined when f(x) = g(x) = 0, but 
such values of z have no effect because the integrand receives zero weight in the in- 
tegration.) The quantity .%(f;g) is called the Kullback-Leibler information criterion 
(KLIC). Another way to state expression (13.100) is 


E{log[ f(z)]} = E{log{g(z)]}, (13.101) 


where z e Z c R™ is a random vector with density f. 
Conditional MLE relies on a conditional version of inequality (13.99): 


PROPERTY CD.1: Let ye Yc Rf and xe% c R* be random vectors. Let p(-|-) 
denote the conditional density of y given x. For each x, let ¥(x) = {y : p(y |x) > 0} 
be the conditional support of y, and let v be a measure that does not depend on x. 
Then, for any other function g(-| x) > 0 such that 


=] ply |x)vldy) | g(y|x)v(dy), 
W(x) x 


w(x) 


the conditional KLIC is nonnegative. 
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Ix(D39) 


|, ele 9/7 xpt xay) = 0 


That is, 


E{log[p(y|x)] |x} = E{loglg(y | x)] |x} 


for any x e X. The proof uses the conditional Jensen’s inequality (Property CE.7 in 
Chapter 2). See Manski (1988, Sect. 5.1). 


PROPERTY CD.2: For random vectors y, x, and z, let p(y|x,z) be the conditional 
density of y given (x,z) and let p(z|z) denote the conditional density of x given z. 
Then the density of (y, x) given z is 


Ply, æ |z) = ply | ,z)p(e |z), 
where the script variables are placeholders. 


PROPERTY CD.3: For random vectors y, x, and z, let p(y|x,z) be the conditional 
density of y given (x, z), let p(y|x) be the conditional density of y given x, and let 
p(«|x) denote the conditional density of z given x with respect to the measure v(d<). 
Then 


ply|x) = | oy |x, <)p(|x)v(de). 


In other words, we can obtain the density of y given x by integrating the density of y 
given the larger conditioning set, (x,z), against the density of z given x. 


PROPERTY CD.4: Suppose that the random variable, u, with cdf, F, is independent of 
the random vector x. Then, for any function a(x) of x, Plu < a(x) |x] = Fla(x)]. 


l 4 Generalized Method of Moments and Minimum Distance Estimation 


In Chapter 8 we saw how the generalized method of moments (GMM) approach to 
estimation can be applied to multiple-equation linear models, including systems of 
equations, with exogenous or endogenous explanatory variables, and to panel data 
models. In this chapter we extend GMM to nonlinear estimation problems. This 
setup allows us to treat various efficiency issues that we have glossed over until now. 
We also cover the related method of minimum distance estimation. Because the 
asymptotic analysis has many features in common with Chapters 8 and 12, the anal- 
ysis is not quite as detailed here as in previous chapters. A good reference for this 
material, which fills in most of the gaps left here, is Newey and McFadden (1994). 


14.1 Asymptotic Properties of Generalized Method of Moments 


Let {w; e R”: i= 1,2,...} denote a set of independent, identically distributed ran- 
dom vectors, where some feature of the distribution of w; is indexed by the P x 1 
parameter vector 0. The assumption of identical distribution is mostly for notational 
convenience; the following methods apply to independently pooled cross sections 
without modification. 

We assume that for some function g(w;, 0) e IR’, the parameter 0, € © c R? sat- 
isfies the moment assumptions 


Elg(wi, 90)] = 0. (14.1) 


As we saw in the linear case, where g(w;, 0) was of the form Z'(y; — X;0), a minimal 
requirement for these moment conditions to identify 0) is L > P. If L = P, then 
the analogy principle suggests estimating 0) by setting the sample counterpart, 
N! SÀ g(w;, 0), to zero. In the linear case, this step leads to the instrumental vari- 
ables estimator (see equation (8.22)). When L > P, we can choose Ê to make the 
sample average close to zero in an appropriate metric. A GMM estimator, 6, mini- 
mizes a quadratic form in >", g(w;, 0): 


N N 
min bs xn. 0) py xn.) (14.2) 


where É is an L x L symmetric, positive semidefinite (p.s.d.) weighting matrix. 

Consistency of the GMM estimator follows along the lines of consistency of the 
M-estimator in Chapter 12. Under standard moment conditions, N~! y 1 g(w;,0) 
satisfies the uniform law of large numbers (see Theorem 12.1). If É 2, Bo, where E is 
an L x L positive definite matrix, then the random function 
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N f N 

On(0) = pe 5 sio) E [pn S aim. 0) (14.3) 
i=l i=l 

converges uniformly in probability to 


{Elg(wi, 0)]} 'Eo{E[g(w; 0)]}- (14.4) 


Because ©, is positive definite, 0, uniquely minimizes expression (14.4). For com- 
pleteness, we summarize with a theorem containing regularity conditions: 


THEOREM 14.1 (Consistency of GMM): Assume that (a) © is compact; (b) for each 
0 € ©, g(-,@) is Borel measurable on W; (c) for each w e W, g(w,-) is continuous on 
©; (d) |g;(w, @)| < b(w) for all 0€ © and j= 1,..., L, where b(-) is a nonnegative 
function on W such that E[b(w)] < 0; (e) Ê > Eo, an L x L positive definite matrix; 
and (f) Os is the unique solution to equation (14.1). Then a random vector 6 exists 
that solves problem (14.2), and 6 +, Oo. 


If we assume only that &, is p.s.d., then we must directly assume that 0, is the unique 
minimizer of expression (14.4). Occasionally this generality is useful, but we will not 
need it. 

Under the assumption that g(w,-) is continuously differentiable on int(@), 0) € 
int(@), and other standard regularity conditions, we can easily derive the limiting 
distribution of the GMM estimator. The first-order condition for Ê can be written as 


N TIN E 

bs Vog(wi, J = bs g(W;, J =0. (14.5) 
i=l i=l 

Define the L x P matrix 


Go = E[Vog(wi, 9)], (14.6) 


which we assume to have full rank P. This assumption essentially means that the 
moment conditions (14.1) are nonredundant. Then, by the weak law of large numbers 
(WLLN) and central limit theorem (CLT), 


N N 

N XO Vog(w;,5) Go and NPY g(w;, 0.) = O,(1), (14.7) 
i=l i=] 

respectively. Let g,(0) = g(w;,0). A mean value expansion of >", g(w;, 6) about 0,, 


appropriate standardizations by the sample size, and replacing random averages with 
their plims gives 
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N 

0 = GEN! X g,(0o) + AoVN(8 — ) + Op(1), (14.8) 
{=l 

where 

As = GEG. (14.9) 


Since A, is positive definite under the given assumptions, we have 


N 
VN (Ô — 0) = —A3!G,E N~! § g;(0o) + op(1) “ Normal(0, Az'BoA;'), 
i=l 


(14.10) 
where 
B, = GLE, AoE oGo (14.11) 
and 
A, = Elg,(0.)8;(00)'] = Varlg,(0.)]. (14.12) 


Expression (14.10) gives the influence function representation for the GMM estima- 
tor, and it also gives the limiting distribution of the GMM estimator. We summarize 
with a theorem, which is essentially given by Newey and McFadden (1994, Theorem 
3.4): 


THEOREM 14.2 (Asymptotic Normality of GMM): In addition to the assumptions in 
Theorem 14.1, assume that (a) 0, is in the interior of ©; (b) g(w,-) is continuously 
differentiable on the interior of © for all w € W; (c) each element of g(w, 0.) has finite 
second moment; (d) each element of Vọg(w,0) is bounded in absolute value by a 
function b(w), where E[b(w)] < 00; and (e) Go in expression (14.6) has rank P. Then 
expression (14.10) holds, and so Avar(@) = A,'BoA,'/N. 


Estimating the asymptotic variance of the GMM estimator is easy once Ê has been 
obtained. A consistent estimator of A, is given by 


A=N7! » g,(0)g,(0)' (14.13) 


A=G'SG, B=G'EAEG, (14.14) 


and 
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N 
G=N'S > Wg,(8). (14.15) 
i=l 

As in the linear case in Section 8.3.3, an optimal weighting mains exists for thig 
given moment conditions: & should be a eonnistent estimator of AS . When 2, = =A; 
Bo = Ao and Avar /N(6 — 0») = (GA;'G,)'. Thus the deemed asymptotic 
variances between the general GMM estimator and the estimator with plim = = 
A, is 


(G’E.G.) (GB. A.#G,)(G!E.G.) — (GLAZ Go)! (14.16) 


This expression can be shown to be p.s.d. using the same argument as in Chapter 8 
(see Problem 8.5). 

To obtain an asymptotically efficient GMM estimator we need a preliminary esti- 
mator of 0, in order to obtain A. Let Ê be such an estimator, and define A as in ex- 
pression (14.13) but with Ê in place of Ê. Then, an efficient GMM estimator (given 
the function g(w, @)) solves 


N N 
min 5 g(w;,0 i Seco) (14.17) 
i=1 l 


and its asymptotic variance is estimated as 


Avar(6) = (GA 6)" /N. (14.18) 


As in the linear case, an optimal GMM estimator is called the minimum chi-square 
estimator because 


pore mi SNe) (L 
i=l i=l 


has a limiting chi-square distribution with L — P degrees of freedom under the con- 
ditions of Theorem 14.2. Therefore, the value of the objective function (properly 
standardized by the sample size) can be used as a test of any overidentifying restric- 
tions in equation (14.1) when L > P. If statistic (14.19) exceeds the relevant critical 
value in a y7_p distribution, then equation (14.1) must be rejected: at least some of 
the moment conditions are not supported by the data. For the linear model, this is 
the same statistic given in equation (8.49). 

As always, we can test hypotheses of the form Ho : ¢(@) = 0, where ¢(@) is a 
Q x 1 vector, Q < P, by using the Wald approach and the appropriate variance ma- 
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trix estimator. A statistic based on the difference in objective functions is also avail- 
able if the minimum chi-square estimator is used so that Bo = Ag. Let @ denote the 
solution to problem (14.17) subject to the restrictions e(@) = 0, and let Ê denote the 
unrestricted estimator-solving problem (14.17); importantly, these both use the same 
weighting matrix A~!. Typically, A is obtained from a first-stage, unrestricted esti- 
mator. Assuming that the constraints can be written in implicit form and satisfy the 
conditions discussed in Section 12.6.2, the GMM distance statistic (or GMM criterion 
function statistic) has a limiting Xo distribution: 


N 7 N A N : 
[Ssa +210) a yee 
i=l i=l i=l 


When applied to linear GMM problems, we obtain the statistic in equation (8.45). 
One nice feature of expression (14.20) is that it is invariant to reparameterization of 
the null hypothesis, just as the quasi-likelihood ratio (QLR) statistic is invariant for 
M-estimation. Therefore, we might prefer statistic (14.20) over the Wald statistic 
(8.48) for testing nonlinear restrictions in linear models. Of course, the computation 
of expression (14.20) is more difficult because we would actually need to carry out 
estimation subject to nonlinear restrictions. 

A nice application of the GMM methods discussed in this section is two-step esti- 
mation procedures, which arose in Chapters 6, 12, and 13. Suppose that the estimator 
6—it could be an M-estimator or a GMM estimator—depends on a first-stage esti- 
mator, 7. A unified approach to obtaining the asymptotic variance of 6 is to stack the 
first-order conditions for Ô and ĵ into the same function g(-). This is always possible 
for the estimators encountered in this book. For example, if 7 is an M-estimator 
solving >)’, d(w;, f) = 0, and 6 is a two-step M-estimator solving 


1 
ÂT! 


N 


5y 210) Mv Lan (14.20) 


i=1 


! 
ÂT! 


N 
X- s(w;, ô; f) = 0, (14.21) 
i=1 


then we can obtain the asymptotic variance of Ê by defining 


s(w, 0; y) | 
d(w, y) 


and applying the GMM formulas. The first-order condition for the full GMM prob- 
lem reproduces the first-order conditions for each estimator separately. 

In general, either 7, 6, or both might themselves be GMM estimators. Then, 
stacking the orthogonality conditions into one vector can simplify the derivation of 


a(w, 0,7) = | 
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the asymptotic variance of the second-step estimator 6 while also ensuring efficient 
estimation when the optimal weighting matrix is used. 

Finally, sometimes we want to know whether adding additional moment con- 
ditions does not improve the efficiency of the minimum chi-square estimator. (Adding 
additional moment conditions can never reduce asymptotic efficiency, provided an 
efficient weighting matrix is used.) In other words, if we start with equation (14.1) but 
add new moments of the form E[h(w, 0)] = 0, when does using the extra moment 
conditions yield the same asymptotic variance as the original moment conditions? 
Breusch, Qian, Schmidt, and Wyhowski (1999) prove some general redundancy 
results for the minimum chi-square estimator. Qian and Schmidt (1999) study the 
problem of adding moment conditions that do not depend on unknown parameters, 
and they characterize when such moment conditions improve efficiency. 


14.2 Estimation under Orthogonality Conditions 


In Chapter 8 we saw how linear systems of equations can be estimated by GMM 
under certain orthogonality conditions. In general applications, the moment con- 
ditions (14.1) almost always arise from assumptions that disturbances are uncorre- 
lated with exogenous variables. For a G x 1 vector r(w;,0) and a G x L matrix Z;, 
assume that 0, satisfies 


E[Zir(w;, 0)] = 0. (14.22) 


The vector function r(w;,0) can be thought of as a generalized residual function. The 
matrix Z; is usually called the matrix of instruments. Equation (14.22) is a special case 
of equation (14.1) with g(w;,0) = Zir(w;,@). In what follows, write r;(0) = r(w;,@). 

Identification requires that 0, be the only 0 € ® such that equation (14.22) holds. 
Condition e of the asymptotic normality result Theorem 14.2 requires that rank 
E[Z'Vor;(9.)| = P (necessary is L > P). Thus, while Z; must be orthogonal to r;(0,), 
Z; must be sufficiently correlated with the G x P Jacobian, Vyr;(0.). In the linear case 
where r(w;,9) = y; — X;0, this requirement reduces to E(Z/X;) having full column 
rank, which is simply Assumption SIV.2 in Chapter 8. 

Given the instruments Z;, the efficient estimator can be obtained as in Section 14.1. 
A preliminary estimator 6 is usually obtained with 


N —1 
== (r 5 x2) (14.23) 
i=l 


so that Ê solves 
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N 1 N =l 
. ha: -1 17, 
min j zi) x 2 ZZ; 


The solution to problem (14.24) is called the nonlinear system 2SLS estimator; it is an 
example of a nonlinear instrumental variables estimator. 

From Section 14.1, we know that the nonlinear system 2SLS estimator is guaran- 
teed to be the efficient GMM estimator if for some a? > 0, 


S zino) . (14.24) 
i=1 


E[Zit;(9o)¥i(Oo)' Zi] = o2E(Z;Zi). 


Generally, this is a strong assumption. Instead, we can obtain the minimum chi-square 
estimator by obtaining 


N A à 
Â =N! X Zir:(Ô)r:(ô)'Z; (14.25) 
il 


and using this in expression (14.17). 
In some cases more structure is available that leads to a three-stage least squares 
estimator. In particular, suppose that 


E[Z)x;(90)ti(9o)' Zi] = E(ZjQ0Zi), (14.26) 
where Q, is the G x G matrix 
Q, = Elr;(90)ri(Io)']. (14.27) 


When E{r;(@)] = 0, as is almost always the case under assumption (14.22), Qo is the 
variance matrix of r;(@,). As in Chapter 8, assumption (14.26) is a kind of system 
homoskedasticity assumption. 

By iterated expectations, a sufficient condition for assumption (14.26) is 


Elr;(05)r;(Oo)" | Zi] = Q. (14.28) 


However, assumption (14.26) can hold in cases where assumption (14.28) does not. 
If assumption (14.26) holds, then A, can be estimated as 


N 
A=N7! 5 ZZ, (14.29) 
i=l 


where 


Q=Nn"! D r;(Ô)r:(ô) (14.30) 


i=1 
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and 6 is a preliminary estimator. The resulting GMM estimator is usually called the 
nonlinear 3SLS (N3SLS) estimator. The name is a holdover from the traditional 
3SLS estimator in linear systems of equations; there are not really three estimation 
steps. We should remember that nonlinear 3SLS is generally inefficient when as- 
sumption (14.26) fails. 

The Wald statistic and the QLR statistic can be_computed as in Section 14.1. In 
addition, a score statistic is sometimes useful. Let 8 be a preliminary inefficient esti- 
mator with Q restrictions imposed. The estimator 8 would usually come from prob- 
lem (14.24) subject to the restrictions e(@) = 0. Let À be the estimated weighting 
matrix from equation (14.25) or (14.29), based on @. Let @ be the minimum chi- 
square estimator using weighting matrix A-!. Then the score statistic is based on the 
limiting distribution of the score of the unrestricted objective function evaluated at 
the restricted estimates, properly standardized: 


d 


N N 
N XO Zi Vor:(6)| A | NO"? S Zir:(ô) |. (14.31) 
i=l i=] 


Let 8; = G’A''Z'r;, where G is the first matrix in expression (14.31), and let s? = 
G)A,'Z'r°. Then, following the proof in Section 12.6.2, it can be shown that equa- 
tion (12.67) holds with A, = GLA Go. Further, since By = Aj for the minimum chi- 
square estimator, we obtain 


LM = (> s) A! (>: s) /N, (14.32) 
i=] i=l 


where A = G'A'G. Under Hp and the usual regularity conditions, LM has a limit- 
ing 76 distribution. 


14.3 Systems of Nonlinear Equations 


A leading application of the results in Section 14.2 is to estimation of the parameters 
in an implicit set of nonlinear equations, such as a nonlinear simultaneous equations 
model. Partition w; as y; € IR’, x; € R* and, for h = 1,..., G, suppose we have 


qı (Yi Xi, 051) = Uil, 
(14.33) 


GG(Y¥;; Xi, 9G) = uic 
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where Oon is a Ph x 1 vector of parameters. As an example, write a two-equation 
SEM in the population as 


yy = x10; + yyy? + t, (14.34) 


Vo = X202 + y3 Yı + u2 (14.35) 


(where we drop “o” to index the parameters). This model, unlike those covered in 
Section 9.5, is nonlinear in the parameters as well as the endogenous variables. Nev- 
ertheless, assuming that E(ug |x) = 0, g = 1,2, the parameters in the system can be 
estimated by GMM by defining 9) (y,x, 01) = yı — x16) — 7, y} and q2(y, x, 02) = 
y2 — X202 — y3 Y1. 

Generally, the equations (14.33) need not actually determine y; given the exoge- 
nous variables and disturbances; in fact, nothing requires J = G. Sometimes equations 
(14.33) represent a system of orthogonality conditions of the form E[g,(y, x, Oog) | x] = 
0,g =1,...,G. We will see an example later. 

Denote the P x 1 vector of all parameters by 0,, and the parameter space by © c 
IR’. To identify the parameters, we need the errors u; to satisfy some orthogonality 
conditions. A general assumption is, for some subvector xj, of x;, 


E(un|xn)=0, h=1,2,...,G. (14.36) 


This allows elements of x; to be correlated with some errors, a situation that some- 
times arises in practice (see, for example, Chapter 9 and Wooldridge, 1996). Under 
assumption (14.36), let zi, = f;(xj,) be a 1 x La vector of possibly nonlinear func- 
tions of x;. If there are no restrictions on the Osn across equations, we should have 
Ln > Ph so that each Osn is identified. By iterated expectations, for all h = 1,...,G, 


E(zZ;,Uin) = 0, (14.37) 


provided appropriate moments exist. Therefore, we obtain a set of orthogonality 
conditions by defining the G x L matrix Z; as the block diagonal matrix with zi, in 
the gth block: 


zi 0 0 -- 0 
0 72 0-- 0 

Z=]. AE (14.38) 
0 0 0 zg 


where L = Lı + Lz +--+: + Lg. Letting r(w;, 0) = q(y;,x:,9) = [gi(01),---,Gig(Oc)]’ 
equation (14.22) holds under assumption (14.36). 
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When there are no restrictions on the @, across equations and Z; is chosen as in 
matrix (14.38), the system 2SLS estimator reduces to the nonlinear 2SLS (N2SLS) 
estimator (Amemiya, 1974) equation by equation. That is, for each h, the N2SLS 
estimator solves 


N ! N -IF y 
nial ra) (x> riza) È zal) (14.39) 
"Gel i=l i=l 


Given only the orthogonality conditions (14.37), the N2SLS estimator is the efficient 
estimator of Oo, if 


E(uj,2;,Zin) = ECZ Zin) (14. 40) 


where aż, = E(u3,); sufficient for condition (14.40) is E(uj, | zin) = aĝ,. Let 6, denote 
the N2SLS estimator. Then a consistent estimator of oå, is 


ai, =N "il Yin (14.41) 


where ij, = anya Xi Ên) are the N2SLS residuals. Under assumptions (14.37) and 
(14.40), the asymptotic variance of Ô, is estimated as 


ir i = 
Gj, 5 Z, Vo,qin( (ê) | (>: Lik ms) I road) ; (14.42) 
i=! 


l 


where Vo, gin(On) is the 1 x P} gradient. 

If assumption (14.37) holds but assumption (14.40) does not, the N2SLS estimator 
is still /N-consistent, but it is not the efficient estimator that uses the orthogonality 
condition (14.37) whenever L, > P, (and expression (14.42) is no longer valid). A 
more efficient estimator is obtained by solving 


N i N -IF y 
min 5 ra) (m 5 ii) bs rt) 
i i=l i=l 


i=l 
with asymptotic variance estimated as 


=I 


Irn 7 
pa Zin Vo, din (61) | ($a Uj;,Z ind rs) bs road) 


iil i=l 
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This estimator is asymptotically equivalent to the N2SLS estimator if assumption 
(14.40) happens to hold. 

Rather than focus on one equation at a time, we can increase efficiency if we esti- 
mate the equations simultaneously. One reason for doing so is to impose cross 
equation restrictions on the Oo. The system 2SLS estimator can be used for these 
purposes, where Z; generally has the form (14.38). But this estimator does not exploit 
correlation in the errors uj, and ujn in different equations. 

The efficient estimator that uses all orthogonality conditions in equation (14.37) is 
just the GMM estimator with A given by equation (14.25), where r(0) is the G x 1 
vector of system 2SLS residuals, a;. In other words, the efficient GMM estimator 
solves 


=i 
pao zao] (x IY zi û;Z z) Szo) (14.43) 
ml i=l 


The asymptotic variance of Ê is estimated as 


zva | (Sz ) S zvat ] 


Because this is the efficient GMM estimator, the QLR statistic can be used to test 
hypotheses about 0,. The Wald statistic can also be applied. 

Under the homoskedasticity assumption (14.26) with r;(0,) = u,, the nonlinear 
3SLS estimator, which solves 


N -IF y 
pa do za ‘qi 0) (1 $ zaz) Szat), 
S i=l i=l 


is efficient, and its asymptotic variance is estimated as 


N ‘7 N -lf y 
S zivo) (Szaz) S zivo) 
i=l i=1 


i=l 


-1 


The N3SLS estimator is used widely for systems of the form (14.33), but, as we dis- 
cussed in Section 9.6, there are many cases where assumption (14.26) must fail when 
different instruments are needed for different equations. 

As an example, we show how a hedonic price system fits into this framework. 
Consider a linear demand and supply system for G attributes of a good or service (see 
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Epple, 1987; Kahn and Lang, 1988; and Wooldridge, 1996). The demand and supply 
system is written as 


demand; = hı + Wig + Xi Bi, + Uig, g=1,...,G, 
suppl¥g = Nog + Wazg + X2Bo, + Urq, g=1,...,G, 


where w = (w,..., wg) is the 1 x G vector of attribute prices. The demand equations 
usually represent an individual or household; the supply equations can represent an 
individual, firm, or employer. 

There are several tricky issues in estimating either the demand or supply function 
for a particular g. First, the attribute prices w, are not directly observed. What is 
usually observed are the equilibrium quantities for each attribute and each cross 
section unit i; call these qig, g = 1,...,G. (In the hedonic systems literature these are 
often denoted z;,, but we use qig here because they are endogenous variables, and we 
have been using z; to denote exogenous variables.) For example, the gi, can be fea- 
tures of a house, such as size, number of bathrooms, and so on. Along with these 
features we observe the equilibrium price of the good, p;, which we assume follows a 
quadratic hedonic price function: 


Pi = 7+ ay + q:1lq;/2 + x36 + xP q; + us, (14.44) 


where x;3 is a vector of variables that affect p;, II is a G x G symmetric matrix, and I 
isa G x G matrix. 

A key point for identifying the demand and supply functions is that w; = ôp;/ôq;, 
which, under equation (14.44), becomes w; = qI + xI, Or wig = q; + Xi3y, for 
each g. By substitution, the equilibrium estimating equations can be written as 
equation (14.44) plus 


qig = Mg + (Q + xP )aig + xi Pig + tiig, g=1,...,G, (14.45) 


dig _ Nag T (qI T Xi31)ar, T X85, SE Uig, g= 1, a a G. (14.46) 


These two equations are linear in q;, Xi1, X;2, and x; but nonlinear in the parameters. 

Let u; be the G x 1 vector of attribute demand disturbances and u; the G x 1 
vector of attribute supply disturbances. What are reasonable assumptions about 
uj1,U;2, and u;3? It is almost always assumed that equation (14.44) represents a con- 
ditional expectation with no important unobserved factors; this assumption means 
E(u; | q;, Xi) = 0, where x; contains all elements in xj;1,xj;2, and x;3. The properties of 
u; and u; are more subtle. It is clear that these cannot be uncorrelated with q;, and 
so equations (14.45) and (14.46) contain endogenous explanatory variables if II # 0. 
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But there is another problem, pointed out by Bartik (1987), Epple (1987), and Kahn 
and Lang (1988): because of matching that happens between individual buyers and 
sellers, X;2 1s correlated with uj, and x; is correlated with u;2. Consequently, what 
would seem to be the obvious IVs for the demand equations (14.45)—the factors 
shifting the supply curve—are endogenous to equation (14.45). Fortunately, all is not 
lost: if x; contains exogenous factors that affect p; but do not appear in the struc- 
tural demand and supply functions, we can use these as instruments in both the de- 
mand and supply equations. Specifically, we assume 


E(u; | Xa, X3) = 0, E(u;2 | Xi2, X;3) = 0, E(uj3 | qi; x;) = 0. (14.47) 


Common choices for x;3 are geographical or industry dummy indicators (for exam- 
ple, Montgomery, Shaw, and Benedict, 1992; Hagy, 1998), where the assumption is 
that the demand and supply functions do not change across region or industry but the 
type of matching does, and therefore p; can differ systematically across region or in- 
dustry. Bartik (1987) discusses how a randomized experiment can be used to create 
the elements of x;3. 

For concreteness, let us focus on estimating the set of demand functions. If II = 0, 
so that the quadratic in q; does not appear in equation (14.44), a simple two-step 
procedure is available: (1) estimate equation (14.44) by OLS, and obtain WwW, = 
ý, + xi3), for each i and g; (2) run the regression qig on 1, Wi, X;,i = 1,..., N. Under 
assumptions (14.47) and identification assumptions, this method produces vN- 
consistent, asymptotically normal estimators of the parameters in demand equation 
g. Because the second regression involves generated regressors, the standard errors 
and test statistics should be adjusted. 

It is clear that, without restrictions on ag, the order condition necessary for iden- 
tifying the demand parameters is that the dimension of x;3, say K3, must exceed G. If 
K; < G, then E[(w;, xa) (wi, x; )| has less than full rank, and the OLS rank condition 
fails. If we make exclusion restrictions on a),, fewer elements are needed in x;3. In the 
case that only wig appears in the demand equation for attribute g, x;3 can be a scalar, 
provided its interaction with qi, in the hedonic price system is significant (y, # 0). 
Checking the analogue of the rank condition in general is somewhat complicated; see 
Epple (1987) for discussion. 

When w; = qI + xi3I, w; is correlated with uj,, so we must modify the two-step 
procedure. In the second step, we can use instruments for w; and perform 2SLS rather 
than OLS. Assuming that x; has enough elements, the demand equations are still 
identified. If only wiy appears in demand;,, sufficient for identification is that an ele- 
ment of x; appears in the linear projection of wig on xj, x3. This assumption can 
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hold even if x; has only a single element. For the matching reasons we discussed 
previously, x;2 cannot be used as instruments for w; in the demand equation. 

Whether IT = 0 or not, more efficient estimators are obtained from the full demand 
system and the hedonic price function. Write 


q; = + (qli + xg ľ[)A; + xB) + ua 


along with equation (14.44). Then (xj,x;3) (and functions of these) can be used as 
instruments in any of the G demand equations, and (q,,x;) act as IVs in equation 
(14.44). (It may be that the supply function is not even specified, in which case x; 
contains only x; and x;3.) A first-stage estimator is the nonlinear system 2SLS esti- 
mator. Then the system can be estimated by the minimum chi-square estimator that 
solves problem (14.43). When restricting attention to demand equations plus the 
hedonic price equation, or supply equations plus the hedonic price equation, nonlinear 
3SLS is efficient under certain assumptions. If the demand and supply equations are 
estimated together, the key assumption (14.26) that makes nonlinear 3SLS asymp- 
totically efficient cannot be expected to hold; see Wooldridge (1996) for discussion. 

If one of the demand functions is of primary interest, it may make sense to estimate 
it along with equation (14.44), by GMM or nonlinear 3SLS. If the demand functions 
are written in inverse form, the resulting system is linear in the parameters, as shown 
in Wooldridge (1996). 


14.4 Efficient Estimation 


In Chapter 8 we obtained the efficient weighting matrix for GMM estimation of 
linear models, and we extended that to nonlinear models in Section 14.1. In Chapter 
13 we asserted that maximum likelihood estimation has some important efficiency 
properties. We are now in a position to study a framework that allows us to show the 
efficiency of an estimator within a particular class of estimators, and also to find 
efficient estimators within a stated class. Our approach is essentially that in Newey 
and McFadden (1994, Sect. 5.3), although we will not use the weakest possible 
assumptions. Bates and White (1993) proposed a very similar framework and also 
considered time series problems. 


14.4.1 General Efficiency Framework 


Most estimators in econometrics—and all of the ones we have studied—are y N- 
asymptotically normal, with variance matrices of the form 


V = A'!E|[s(w)s(w)’](A’) (14.48) 
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where, in most cases, s(w) is the score of an objective function (evaluated at 0) and 
A is the expected value of the Jacobian of the score, again evaluated at 0. (We 
suppress an “0” subscript here, as the value of the true parameter is irrelevant.) All 
M-estimators with twice continuously differentiable objective functions (and even 
some without) have variance matrices of this form, as do GMM estimators. The fol- 
lowing lemma is a useful sufficient condition for showing that one estimator is more 


efficient than another. 


LEMMA 14.1 (Relative Efficiency): Let 6, and 6) be two //N-asymptotically normal 
estimators of the P x 1 parameter vector 0o, with asymptotic variances of the form 
(14.48) (with appropriate subscripts on A, s, and V). If for some p > 0, 


Elsi(w)si(w)’] = pAu, (14.49) 
E[s2(w)si(w)'] = pAo, (14.50) 
then V» — V; is p.s.d. 


The proof of Lemma 14.1 is given in the chapter appendix. 

Condition (14.49) is essentially the generalized information matrix equality (GIME) 
we introduced in Section 12.5.1 for the estimator ĝi. Notice that A; is necessarily 
symmetric and positive definite under condition (14.49). Condition (14.50) is new. In 
most cases, it says that the expected outer product of the scores s2 and sı equals the 
expected Jacobian of s2 (evaluated at 0»). In Section 12.5.1 we claimed that the 
GIME plays a role in efficiency, and Lemma 14.1 shows that it does so. 

Verifying the conditions of Lemma 14.1 is also very convenient for constructing 
simple forms of the Hausman (1978) statistic in a variety of contexts. Provided 
that the two estimators are jointly asymptotically normally distributed—something 
that is almost always true when each is /N-asymptotically normal, and that can 
be verified by stacking the first-order representations of the estimators—assumptions 
(14.49) and (14.50) imply that the asymptotic covariance between VN (0) — 0,) and 
VN(01 — 05) is AZ'E(s2s1)AȚ! = A3! (pA2) Az! = pAŢ! = Avar[VN (Ô; — )]. In 
other words, the asymptotic covariance between the (v N-scaled) estimators is equal 
to the asymptotic variance of the efficient estimator. This equality implies that 
Avar|VN (62 — 6;)] = Vo + Vi — C— C' = Vo + Vi — 2V1 = V2 — Vj, where C is the 
asymptotic covariance. If V2— Vj) is actually positive definite (rather than just 
p.s.d.), then [VN (6 — 6,)]/(V2 — V1) [VN (ô> — 0:)] © 72 under the assumptions 
of Lemma 14.1, where V, is a consistent estimator of V,, g = 1, 2. Statistically signifi- 
cant differences between Ê, and ô; signal some sort of model misspecification. 
(See Section 6.2.1, where we discussed this form of the Hausman test for comparing 
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2SLS and OLS to test whether the explanatory variables are exogenous.) If assump- 
tions (14.49) and (14.50) do not hold, this standard form of the Hausman statistic is 
invalid. 

Given Lemma 14.1, we can state a condition that implies efficiency of an estimator 
in an entire class of estimators. It is useful to be somewhat formal in defining the 
relevant class of estimators. We do so by introducing an index, t. For each t in an 
index set, say, Z, the estimator 6. has an associated s, and A, such that the asymp- 
totic variance of VN (0, — 05) has the form (14.48). The index can be very abstract; it 
simply serves to distinguish different V N-asymptotically normal estimators of 0. For 
example, in the class of M-estimators, the set 7 consists of objective functions q(-,-) 
such that 0, uniquely minimizes E[q(w,0)] over ©, and q satisfies the twice con- 
tinuously differentiable and bounded moment assumptions imposed for asymptotic 
normality. For GMM with given moment conditions, 7 is the set of all L x L posi- 
tive definite matrices. We will see another example in Section 14.4.3. Lemma 14.1 
immediately implies the following theorem. 


THEOREM 14.3 (Efficiency in a Class of Estimators): Let {ĝ, : te 7} be a class of 
VN-asymptotically normal estimators with variance matrices of the form (14.48). If 
for some t* €e 7 and p> 0 


E[s,(w)s,-(w)'] = pA, allte J, (14.51) 
then Ê,» is asymptotically relatively efficient in the class {0, : t € 7}. 


This theorem has many applications. If we specify a class of estimators by defining 
the index set .7, then the estimator Ê. is more efficient than all other estimators in the 
class if we can show condition (14.51). (A partial converse to Theorem 14.3 also 
holds; see Newey and McFadden (1994, Section 5.3).) This is not to say that Ê,- is 
necessarily more efficient than all possible /N-asymptotically normal estimators. If 
there is an estimator that falls outside the specified class, then Theorem 14.3 does not 
help us to compare it with 6... In this sense, Theorem 14.3 is a more general (and 
asymptotic) version of the Gauss-Markov theorem from linear regression analysis: 
while the Gauss-Markov theorem states that OLS has the smallest variance in the 
class of linear, unbiased estimators, it does not allow us to compare OLS to unbiased 
estimators that are not linear in the vector of observations on the dependent variable. 


14.4.2 Efficiency of Maximum Likelihood Estimator 


Students of econometrics are often told that the maximum likelihood estimator is 
“efficient.” Unfortunately, in the context of conditional MLE (CMLE) from Chapter 
13, the statement of efficiency is usually ambiguous; Manski (1988, Chap. 8) is a no- 
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table exception. Theorem 14.3 allows us to state precisely the class of estimators in 
which the CMLE is relatively efficient. As in Chapter 13, we let Eg(-|x) denote the 
expectation with respect to the conditional density f(y |x; 0). 

Consider the class of estimators solving the first-order condition 


N 
NS g(wi,6) =0, (14.52) 
i=l 


where the P x 1 function g(w, 0) such that 
Eo[g(w, @) |x] = 0, alxe?, all0eO. (14.53) 


In other words, the class of estimators is indexed by functions g satisfying a zero 
conditional moment restriction. We assume the standard regularity conditions from 
Chapter 12; in particular, g(w, -) is continuously differentiable on the interior of ©. 

As we showed in Section 13.7, functions g satisfying condition (14.53) generally 
have the property 


—E[Vog(w, 8o) | x] = E[g(w, 8o)s(w, A)’ | x], 


where s(w, 0) is the score of log f(y|x;0) (as always, we must impose certain regu- 
larity conditions on g and log /). If we take the expectation of both sides with respect 
to x, we obtain condition (14.51) with p=1, A, = E[Vpg(w,9,)], and s,-(w) = 
—s(w,9,). It follows from Theorem 14.3 that the conditional MLE is efficient in 
the class of estimators solving equation (14.52), where g(-) satisfies condition (14.59) 
and appropriate regularity conditions. Recall from Section 13.5.1 that the asymp- 
totic variance of the (centered and standardized) CMLE is {E[s(w, 0.)s(w, 95)']} 
This is an example of an efficiency bound because no estimator of the form 
(14.52) under condition (14.53) can have an asymptotic variance smaller than 
{E|s(w, 05)s(w, 0)']}' (in the matrix sense). When an estimator from this class has 
the same asymptotic variance as the CMLE, we say it achieves the efficiency bound. 

It is important to see that the efficiency of the CMLE in the class of estimators 
solving equation (14.52) under condition (14.53) does not require x to be ancillary for 
0s: except for regularity conditions, the distribution of x is essentially unrestricted, 
and could depend on 6o. CMLE simply ignores information on 6, that might be 
contained in the distribution of x, but so do all other estimators that are based on 
condition (14.53). 

By choosing x to be empty, we conclude that the unconditional MLE is efficient in 
the class of estimators based on equation (14.52) with Ey[g(w, 0)] = 0, all 0 € ©. This 
is a very broad class of estimators, including all of the estimators requiring condition 
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(14.53): if a function g satisfies condition (14.53), it has zero unconditional mean, too. 
Consequently, the unconditional MLE is generally more efficient than the condi- 
tional MLE. This efficiency comes at the price of having to model the joint density of 
(y, x), rather than just the conditional density of y given x. And, if our model for the 
density of x is incorrect, the unconditional MLE generally would be inconsistent. 

When is CMLE as efficient as unconditional MLE for estimating 0)? Assume that 
the model for the joint density of (x, y) can be expressed as f(y | x; 0)A(x; ô), where 0 
is the parameter vector of interest, and A(x, ôo) is the marginal density of x for some 
vector ĝo. Then, if ô does not depend on @ in the sense that Voh(x; ô) = 0 for all x and 
6, x is ancillary for 0,. In fact, the CMLE is identical to the unconditional MLE. If ô 
depends on 0, the term Vo log[/(x;6)] generally contains information for estimating 
0., and unconditional MLE will be more efficient than CMLE. 


14.4.3 Efficient Choice of Instruments under Conditional Moment Restrictions 


We can also apply Theorem 14.3 to find the optimal set of instrumental variables 
under general conditional moment restrictions. For a G x 1 vector r(w;,@), where 
w; e R™, 0, is said to satisfy conditional moment restrictions if 
E[r(w;, 9) | xi] = 0, (14.54) 
where x; € RÝ is a subvector of w;. Under assumption (14.54), the matrix Z; 
appearing in equation (14.22) can be any function of x;. For a given matrix Z;, we 
obtain the efficient GMM estimator by using the efficient weighting matrix. However, 
unless Z; is the optimal set of instruments, we can generally obtain a more efficient 
estimator by adding any nonlinear function of x; to Z;. Because the list of potential 
IVs is endless, it is useful to characterize the optimal choice of Z;. 

The solution to this problem is now pretty well known, and it can be obtained by 
applying Theorem 14.3. Let 


Q,(x;) = Var[r(w;, 95) | xi] (14.55) 
be the G x G conditional variance of r;(@,) given x;, and define 

Ro(x;) = E[Vor(wi, 9) | xi]. (14.56) 
Problem 14.3 asks you to verify that the optimal choice of instruments is 

Z*(x;) = Q(x) R(x). (14.57) 


The optimal instrument matrix is always G x P, and so the efficient method of 
moments estimator solves 
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N 
NO Z*(x:)'r:(Ê) = 0. 

i=l 

There is no need to use a weighting matrix. Incidentally, by taking g(w,0) = 
Z*(x)'r(w,0), we obtain a function g satisfying condition (14.53). From our discus- 
sion in Section 14.4.2, it follows immediately that the CMLE is no less efficient than 
the optimal IV estimator. 

In practice, Z*(x;) is never a known function of x;. In some cases the function 
Ro(x;) is a known function of x; and 0, and can be easily estimated; this statement is 
true of linear SEMs under conditional mean assumptions (see Chapters 8 and 9) and 
of multivariate nonlinear regression, which we cover later in this subsection. Rarely 
do moment conditions imply a parametric form for Q,(x;), but sometimes homo- 
skedasticity is assumed: 


E[r;(,)ri(8o) | x;] =Q, (14.58) 


and Q, is easily estimated as in equation (14.30), given a preliminary estimate of 0. 

Since both Q,(x;) and Ro(x;) must be estimated, we must know the asymptotic 
properties of GMM with generated instruments. Under conditional moment restric- 
tions, generated instruments have no effect on the asymptotic variance of the GMM 
estimator. Thus, if the matrix of instruments is Z(x;, y,) for some unknown parame- 
ter vector y,, and ĵ is an estimator such that VN (f — y,) = O,(1), then the GMM 
estimator using the generated instruments Z; = Z(x;,) has the same limiting dis- 
tribution as the GMM estimator using instruments Z(x;,y,) (using any weighting 
matrix). This result follows from a mean value expansion, using the fact that the de- 
rivative of each element of Z(x;,7) with respect to y is orthogonal to r;(0@,) under 
condition (14.54): 


N N 
NY Żiri(ô) = N Y Zi(70)'ti(Oo) 
i=l a. 


+ E[Zi(y,)'Ro(xi)|WN (0 — 6%) + op(1). (14.59) 


The right-hand side of equation (14.59) is identical to the expansion with Z; replaced 
with Zilo). 

Assuming now that Z;(y,) is the matrix of efficient instruments, the asymptotic 
variance of the efficient estimator is 


Avar VN(Ô — 05) = {E[Ro(x;)'Qo(x;)'Ro(x;)]} (14.60) 
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as can be seen from Section 14.1 by noting that Go = E[Ro(x;)'Qo(x;)~'Ro(x;)] and 
Ao = G,' when the instruments are given by equation (14.57). 

Equation (14.60) is another example of an efficiency bound, this time under the 
conditional moment restrictions (14.48). What we have shown is that any GMM es- 
timator has variance matrix that differs from equation (14.60) by a p.s.d. matrix. 
Chamberlain (1987) has shown more: any estimator that uses only condition (14.54) 
and satisfies regularity conditions has variance matrix no smaller than equation 
(14.60). 

Estimation of R,(x;) generally requires nonparametric methods. Newey (1990) 
describes one approach. Essentially, regress the elements of Vor; (8) on polynomial 
functions of x; (or other functions with good approximating properties), where 6 is 
an initial estimate of 0s. The fitted values from these regressions can be used as the 
elements of R;. Other nonparametric approaches are available. See Newey (1990, 
1993) for details. Unfortunately, we need a fairly large sample size in order to apply 
such methods effectively. 

As an example of finding the optimal instruments, consider the problem of esti- 
mating a conditional mean for a vector y;: 


E(y; | Xi) = m(x;, 6o). (14.61) 


Then the residual function is r(w;,?) = y;—m(x;,0) and Q,(x;) = Var(y;|x;); 
therefore, the optimal instruments are Z,(x;) = Q,(x;)'Vom(x;, 0,). This is an im- 
portant example where Ro(x;) = —Vọm(x;, 0o) is a known function of x; and 0. If 
the homoskedasticity assumption 


Var(y; | x;) = Qo (14.62) 


holds, then the efficient estimator is easy to obtain. First, let 6 be the multivariate 
nonlinear least squares (MNLS) estimator, which solves minge oa ily; —m(x;,4)]’- 
[y; — m(x;,0)]. As discussed in Section 12.9, the MNLS estimator is generally con- 
sistent and /N-asymptotic normal. Define the residuals i; = y; — m(x;, 6), and de- 
fine a consistent estimator of Q, by Ê = N'À aa’. An efficient estimator, Ê, 
solves 


N r 
5 Vam(x;, Ô) 'Ê'[y; — m(x;,4)] = 0 
=A 


and the asymptotic variance of VN(0— 0») is {E[Vom;(0,)'Q5'!Vom;(0.)]}'. An 
asymptotically equivalent estimator is the nonlinear SUR estimator described in 


Section 12.9. In either case, the estimator of Avar(@) under assumption (14.62) is 
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paa N z 
Avar(ĝ) = XC Vom;(ô)'È®-' Vom; (ô) 
i=] 


Because the nonlinear SUR estimator is a two-step M-estimator and B, = A, (in the 
notation of Chapter 12), the simplest forms of test statistics are valid. If assumption 
(14.62) fails, the nonlinear SUR estimator is consistent, but robust inference should 
be used because A, # Bo. And, the estimator is no longer efficient. 


14.5 Classical Minimum Distance Estimation 


A method that has features in common with GMM and often is a convenient sub- 
stitute, is classical minimum distance (CMD) estimation. 

Suppose that the P x 1 parameter vector of interest, 0,, which often consists of 
parameters from a structural model, is known to be related to an S x 1 vector of 
reduced-form parameters, fo, where S > P. In particular, mo = h(@,) for a known, 
continuously differentiable function h: R? — R“, so that h maps the structural 
parameters into the reduced-form parameters. 

CMD estimation of 0, entails first estimating n by z, and then choosing an esti- 
mator Ô of 0, by making the distance between # and h(ô) as small as possible. As 
with GMM estimation, we use a weighted Euclidean measure of distance. While a 
CMD estimator can be defined for any p.s.d. weighting matrix, we consider only the 
efficient CMD estimator given our choice of z. As with efficient GMM, the CMD 
estimator that uses the efficient weighting matrix is also called the minimum chi- 
square estimator. 

Assuming that for an S x S p.s.d. matrix 2, 


VN (È — To) © Normal(0, Zo) (14.63) 
it turns out that an efficient CMD estimator solves 


min{# — h()}'B {a — h(9)}, (14.64) 
E 
where plimy_,,, = = =Z.. In other words, an efficient weighting matrix is the inverse 
of any consistent estimator of Avar VN(# — ro). 

We can easily derive the asymptotic variance of VN (Ô — 0,). The first-order con- 
dition for @ is 
H(6)'="'{z —h(0)} = 0, (14.65) 
where H(@) = Vọh(0) is the S x P Jacobian of h(@). Since h(@,) = zo and 
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VN {h(8) — h(0,)} = H(8) VN (Ô — 8) + 0p(1), 
by a standard mean value expansion about 05, we have 
0 = H(6)'E {VN (î — zo) — H(%)VN(8 — 5)} + Op (1). (14.66) 


Because H(-) is continuous and Ê> 0, H(@) = H(0.) + op(1); by assumption = = 
=, + 0,(1). Therefore, 


H(0.)'25 'H(8.)VN (Ô — 95) = H(80)'E 5! VN (ĉ — mo) + op(1). 

By assumption (14.63) and the asymptotic equivalence lemma, 
H(0,)'E5'H(8.) VN (Ô — 05) ~ Normal[0, H(0.)'=5'H(8.)], 

and so 

VN(0— 9) © Normal|0, (H,=—'H,) J, (14.67) 


provided that Hy = H(@,) has full-column rank P, as will generally be the case when 
@, is identified and h(-) contains no redundancies. The appropriate estimator of 


Avar(@) is 


—. ssl 


Avar(@) = (H’'="'A)'/N = (H’[Avar(#)|'H)'. (14.68) 


The proof that =! is the optimal weighting matrix in expression (14.64) is very 


similar to the derivation of the optimal weighting matrix for GMM. (It can also 
be shown by applying Theorem 14.3.) We will simply call the efficient estimator 
the CMD estimator, where it is understood that we are using the efficient weighting 
matrix. 

There is another efficiency issue that arises when more than one /N-asymptotically 
normal estimator for z, is available: Which estimator of z, should be used? Let Ô be 
the estimator based on #, and let @ be the estimator based on another estimator, ñ. 
You are asked to show in Problem 14.6 that Avar //N(0 — 0.) — Avar VN(Ô — 05) 
is p.s.d. whenever Avar //N(a— nz) — Avar VN(ĉî -— no) is p.s.d. In other words, 
we should use the most efficient estimator of zo to obtain the most efficient estimator 
of Oo. 

A test of overidentifying restrictions is immediately available after estimation, be- 
cause, under the null hypothesis zo = h(0,), 


N{@ — h(ô) E! [è — h(ô)] ~ 72_». (14.69) 


To show this result, we use 
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VN [# — h(6)] = VN (% — zo) — Ho VN (Ê — 9) + 0p(1) 

= VN (# — no) — Ho (H1 E7Z!Ho) ‘HOES! VN (#@ — zo) + 0p (1) 

= [Is — Ho (HE7 'Ho) 'H,E7']VN(â — ro) + op(1). 
Therefore, up to o,(1), 
Bo! VN {â — h(Ô)} = [Is — 27 ?H,(H,E; H.) H, E3 "Z = MZ, 
where ¥ = E7! VN (î — no) 4, Normal (0, Is). But Mo is a Symmetrie idempotent 
matrix with rank S— P, so {/N[@— h(6)]}'E5'{V/N [a — h(0)|} © 72_p. Because Ê 
is consistent for Z., expression (14.69) follows from the asymptotic equivalence 
lemma. The statistic can also be expressed as 
{ê —h(6)}'[Avar(a@)| {a — h(8)}. (14.70) 


Testing restrictions on @, is also straightforward, assuming that we can express the 
restrictions as 0, = d(a,) for an R x 1 vector ao, R < P. Under these restrictions, 
To = h{d(a.)] = g(a.). Thus, ao can be estimated by minimum distance by solving 
problem (14.64) with a in place of 0 and g(a) in place of h(@). The same estimator Ê 
should be used in both minimization problems. Then it can be shown (under interi- 
ority and differentiability) that 


NIR — g(@)|/S"'[@ — g(@)] — N[@ — h(6)]'S"'[# —h(B)] ~ x5_p, (14.71) 


when the restrictions on @, are true. 


14.6 Panel Data Applications 


We now cover several panel data applications of GMM and CMD. The models in 
Sections 14.6.1 and 14.6.3 are nonlinear in parameters. 


14.6.1 Nonlinear Dynamic Models 


One increasingly popular use of panel data is to test rationality in economic models 
of individual, family, or firm behavior (see, for example, Shapiro, 1984; Zeldes, 1989; 
Keane and Runkle, 1992; Shea, 1995). For a random draw from the population we 
assume that T time periods are available. Suppose that an economic theory implies 
that 


E[r (w, 00) | Wi-1,---, Wi) = 0, CH Neca A (14.72) 
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where, for simplicity, r; is a scalar. These conditional moment restrictions are often 
implied by rational expectations, under the assumption that the decision horizon is 
the same length as the sampling period. For example, consider a standard life-cycle 
model of consumption. Let c; denote consumption of family 7 at time ¢, let h; denote 
taste shifters, let 6, denote the common rate of time preference, and let aj denote the 
return for family 7 from holding asset j from period t — | to t. Under the assumption 


that utility is given by 
u(cir, Ou) = exp(hiro)cr ° /(1 — Ao), (14.73) 


the Euler equation is 
E[(1 + a$) (cie/¢i,-1) ”* | A,11] = (1 + 50) | exp(xiBo); (14.74) 


where Jy is family 7’s information set at time ¢ and x; = h; 1 — hj; equation (14.74) 
assumes that h; — h; ;-1 €-4%,;-1, an assumption which is often reasonable. Given 
equation (14.74), we can define a residual function for each t: 


ri(O) = (1 + a4) (cn /ci,--1) ” — exp(XnB), (14.75) 


where (1 +6)! is absorbed in an intercept in x;. Let wi contain cj, Ci,t-1, Gir, and 
Xi. Then condition (14.72) holds, and 2, and £$, can be estimated by GMM. 

Returning to condition (14.72), valid instruments at time ¢ are functions of infor- 
mation known at time ¢ — 1: 


Zt = f,(we1,...,W1). (14.76) 


The T x 1 residual vector is r(w,0) = [r1(w1,9),...,rr(wr,@)]', and the matrix of 
instruments has the same form as matrix (14.38) for each i (with G = T). Then, the 
minimum chi-square estimator can be obtained after using the system 2SLS estima- 
tor, although the choice of instruments is a nontrivial matter. A common choice is 
linear and quadratic functions of variables lagged one or two time periods. 

Estimation of the optimal weighting matrix is somewhat simplified under the con- 
ditional moment restrictions (14.72). Recall from Section 14.2 that the optimal esti- 
mator uses the inverse of a consistent estimator of Ay = E[Z'1;(05)r;(O5)'Zi]. Under 
condition (14.72), this matrix is block diagonal. Dropping the i subscript, the (s, £) 
block is E[r;(0o)r;(@0)z/z:]. For concreteness, assume that s < t. Then z;, Zs, and r5(0o) 
are all functions of w;—1, W;-2,--..,W1. By iterated expectations it follows that 


Elrs(O) (Oo) 2/21] = Ef{rs(05)2(Z:E[r:(O0) | Wi-1,---, Wi] } = 0, 


and so we only need to estimate the diagonal blocks of E[Z'r;(0.)r;(0)' Zi): 
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N 

NYO Fai ti (14.77) 
i=l 

is a consistent estimator of the tth block, where the 7, are obtained from an inefficient 

GMM estimator. 

In cases where the data frequency does not match the horizon relevant for decision 
making, the optimal matrix does not have the block diagonal form: some off-diagonal 
blocks will be nonzero. See Hansen (1982) for the pure time series case. 

The previous model does not allow for unobserved heterogeneity, a feature that 
can be important in rational expectations types of applications. For example, in ad- 
dition to having unobserved tastes in the utility function, individuals, families, and 
firms may have different discount rates. Wooldridge (1997a) shows how to obtain 
orthogonality conditions when the previous framework allows for unobserved heter- 
ogeneity in particular ways. GMM can be applied to the resulting nonlinear moment 
conditions. 

There are many other kinds of dynamic panel data models that result in moment 
conditions that are nonlinear in parameters, even if the underlying model is linear. 
For example, for the linear, unobserved effects AR(1) model, Ahn and Schmidt 
(1995) add to the moment conditions used in the Arellano and Bond (1991) proce- 
dure that we covered in Section 11.6.2. Some of these moment conditions are non- 
linear in the parameters and can be exploited using nonlinear GMM. 

Recently, Wooldridge (2009b) showed how to implement parametric versions of 
Olley and Pakes (1996) and Levinsohn and Petrin (2003) for estimating firm-level 
production functions with panel data. The general approach is to replace unobserved 
productivity with functions of state variables and proxy variables. When the un- 
known functions are approximated using polynomials, Wooldridge (2009b) shows 
that the estimation problem can be structured as two equations (for each time period) 
with different instruments available for the different equations. 


14.6.2 Minimum Distance Approach to the Unobserved Effects Model 


In Section 11.1.2 we discussed Chamberlain’s (1982, 1984) approach to estimating 
the unobserved effects model yj; = xj, + ci (where we do not index the true value by 
“o” in this subsection). Rather than eliminate c; from the equation, Chamberlain 
replaced it with a linear projection on the entire history of the explanatory variables. 
In Section 11.1.2, we showed how to estimate the parameters ina GMM framework. 
It is also useful to see Chamberlain’s original suggestion, which was to estimate the 


parameters using CMD estimation. 
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Recall that the key equations, after we substitute in the linear projection, are 


Vie = WH XA + + Xi BH Ar) H+ + XirAr + Vit, (14.78) 
where 
E(vz)=0, E(xioy)=0, t= 1,2,...,T. (14.79) 
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(For notational simplicity we do not index the true parameters by “o’’.) Equation 
(14.78) embodies the restrictions on the “structural’’ parameters 0 = (W,4),..., 
A, B')', a (1+ TK + K) x 1 vector. To apply CMD, write 


Vir = Tro + XiT; + Vit, al eee A 
so that the vector z is T(1 + TK) x 1. When we impose the restrictions, 
To = Y, Ti = (AA cece (PEI ety dp), = PH lee: 


Therefore, we can write m = H@ for a (T + T?K) x (1+ TK + K) matrix H. When 
T = 2, z can be written with restrictions imposed as m= (W,B'+4),45,W,4,,B'+ 
25)’, and so 


10 0 0 
0 Ik 0 Ik 
H_|9 0 Ik 0 
10 0 0 
0k 0 0 
00 k k 


The CMD estimator can be obtained in closed form, once we have z; see Problem 
14.7 for the general case. 

How should we obtain z, the vector of estimates without the restrictions imposed? 
There is really only one way, and that is OLS for each time period. Condition (14.79) 
ensures that OLS is consistent and /N-asymptotically normal. Why not use a system 
method, in particular, SUR? For one thing, we cannot generally assume that v; sat- 
isfies the requisite homoskedasticity assumption that ensures that SUR is more eff- 
cient than OLS equation by equation; see Section 11.1.2. Anyway, because the same 
regressors appear in each equation and no restrictions are imposed on the z,, OLS 
and SUR are identical. Procedures that might use nonlinear functions of x; as 
instruments are not allowed under condition (14.79). 

The estimator Ê of Avar VN (z — r) is the robust asymptotic variance for system 
OLS from Chapter 7. 
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—1 


N = N N 
Ê = q 5 xx) = Soxa) (x yx] (14.80) 
i=1 i=l i=1 


where X; = Ir ® (1,x;) is T x (T + T?K) and ¥; is the vector of OLS residuals; see 
also equation (7.26). 

Given the linear model with an additive unobserved effect, the overidentification 
test statistic (14.69) in Chamberlain’s setup is a test of the strict exogeneity assump- 
tion. Essentially, it is a test of whether the leads and lags of x, appearing in each time 
period are due to a time-constant unobserved effect c;. The number of overidentifying 
restrictions is (T + T*K) — (1 + TK + K). Perhaps not surprisingly, the minimum 
distance approach to estimating 0 is asymptotically equivalent to the GMM proce- 
dure we described in Section 11.1.2, as can be reasoned from the work of Angrist and 
Newey (1991). 

One hypothesis of interest concerning @ is that 4, = 0, t= 1,...,7. Under this 
hypothesis, the random effects assumption that the unobserved effect c; is uncorre- 
lated with x; for all t holds. We discussed a test of this assumption in Chapter 10. 
A more general test is available in the minimum distance setting. First, estimate 
a = (w,p’)' by minimum distance, using # and Ê in equation (14.80). Second, com- 
pute the test statistic (14.71). Chamberlain (1984) gives an empirical example. 

Minimum distance methods can be applied to more complicated panel data models, 
including some of the duration models that we cover in Chapter 22. (See Han and 
Hausman, 1990.) Van der Klaauw (1996) uses minimum distance estimation in a 
complicated dynamic model of labor force participation and marital status. 


14.6.3 Models with Time-Varying Coefficients on the Unobserved Effects 


Now we extend the usual linear model to allow the unobserved heterogeneity to have 
time-varying coefficients: 


Vie = Xuß + N,Ci + uit, t= 1 een TS (14.81) 


where, with small T and large N, it makes sense to treat {y7,:f=1,...,T} as 
parameters, like B. We still view c; as a random draw that comes with the observed 
variables for unit i. This model has many applications. Labor economists sometimes 
think the return to unobserved “talent” might change over time. Those who esti- 
mate, say, firm-level production functions like to allow the importance of unobserved 
factors, such as managerial skill, to change over time. 

Because c; is unobserved, we cannot identify T separate coefficients in (14.81). 
A convenient normalization is to set the coefficient for t = 1 to unity: 7, = 1, and 
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we will use this in what follows. Thus, if we seek to estimate the time-varying 
coefficients—sometimes called the factor loads—then we only estimate 7,...,7,7. 
Unless otherwise stated, we make the strict exogeneity assumption conditional on c;: 


E(ui |X, X2,- <- , XiT, Ci) = 9, t= 1D cae, (14.82) 


an assumption we used frequently in Chapter 10 for random effects, fixed effects, and 
first differencing estimation. 

Before we discuss estimation of f along with the 7,, we can ask how estimators that 
ignore the time-varying coefficients fare in estimating $. After all, in some applica- 
tions we might be primarily interested in f, but we are concerned that ignoring the 
time-varying loads causes inconsistency of traditional methods. If we take a random 
effects approach and assume that c; is uncorrelated with xy for all t, what happens if 
we just apply the usual RE estimator from Chapter 10? Let u, = E(c;) and write 
(14.81) as 


Vie = % + Xuß + Ndi + it, f= EE (14.83) 


where &; = 7,4. and di = Ci — He has zero mean. In addition, the composite error, 
Vit = Ndi + Un, is uncorrelated with (x;1,X2,...,X;r) (as well as having a zero mean). 
Of course, vy generally has a time-varying variance as well as serial correlation that is 
not of the standard RE form found in Assumption RE.3 from Chapter 10. Never- 
theless, as we learned there and in Chapter 7, applying feasible GLS to a linear panel 
data equation is consistent provided v;,, the error term, is uncorrelated with x;, the 
explanatory variables, across all time periods. The bottom line is, applying the usual 
RE estimator to (14.83) produces a consistent estimator of B even though we have 
ignored the y,. Of course, the usual RE variance matrix estimator is inconsistent, so 
robust inference is needed, but that is straightforward. Further, the usual RE esti- 
mator is inefficient relative to the FGLS estimator that does not restrict the variance 
matrix of the composite error. If we made the standard homoskedasticity and serial 
independence assumptions on {u;} (conditional on x;, as usual), we could derive the 
variance matrix of the T x 1 vector v; and obtain the appropriate, but restricted, 
variance matrix. Incidentally, in saying we estimate (14.83) by RE, it is imperative 
that we include a full set of year intercepts—that is, an overall constant and then 
T — 1 time dummies—to account for the first term. Here we find yet another reason 
to allow for a different intercept in each time period. 

We can also evaluate the usual fixed effects (FE) estimator when we allow corre- 
lation between c; and x;. The standard FE estimator is consistent provided X; is 
uncorrelated with v;. But uj is uncorrelated with X; by (14.82), and so the key con- 
dition is 
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Cov(žm ci) =0,  t=1,...,T. (14.84) 


In other words, the unobserved effect is uncorrelated with the deviations X; = 
Xj, — X;. We encountered a similar condition in Section 11.7.3 when we studied the 
assumptions when the usual FE estimator consistency estimates the population av- 
erage slope in a model with random slopes. 

Because (14.84) allows for correlation between x; and c;, we can conclude that the 
usual FE estimator has some robustness to the presence of y, for estimating p. Fur- 
ther, if we use the extended FE estimators for random trend models—see Section 
11.7.3—then we can replace X; with detrended covariates. Then, c; can be correlated 
with underlying levels and trends in xy (provided we have a sufficient number of time 
periods). 

Using the usual RE or FE estimators (with full time period dummies) does not 
allow us to estimate the 7,, or even determine whether the 7, change over time. Even 
if we are interested only in £ when c; and Xx; are allowed to be correlated, being able 
to detect time-varying factor loads is important because (14.84) is not completely 
general. It would be very useful to have a simple test of Ho : m = 793 =--:=nr=1 
with some power against the alternative of time-varying coefficients. Then we can 
determine whether a more sophisticated estimation method might be needed. 

We can obtain a simple variable addition test that can be computed using linear 
estimation methods if we specify a particular relationship between c; and x;. We use 
the Mundlak (1978) assumption that we introduced in Chapter 10: 


Ci = Y + Xič + ai. (14.85) 
Then 
Vit = NW + Xuß + 9, XE + 9,4; + Ui = tr + Xuß + Xiğ + AXE + ai + Apa; + uir, (14.86) 


where 2, = y, — 1 for all ¢. Under the null hypothesis, 2, = 0, t = 2,..., T. Note that 
if we impose the null hypothesis, the resulting model is linear, and we can estimate it 
by pooled OLS of y; on 1, d2,,...,dT;, Xi, X; across t and i, where the dr, are time 
dummies. As we discussed in Section 10.7.2, the estimate of $ is the usual FE esti- 
mator. More important for our current purposes, we obtain an estimate of č, which 
we call €. The following variable addition test can be derived from the score principle 
in the context of nonlinear regression; we can use an informal derivation directly 
from equation (14.86). If A, is different from zero, then the coefficient on x;é at time t 
is different from unity, the coefficient on x;€ at t = 1. Therefore, we add the interac- 
tion terms dr, (Xê) for r= 2,..., T to the FE estimation. That is, we construct the 
auxiliary equation 
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Vir = 0 + %2d2; + +--+ a7dT, + Xuß + And2,(X;é) pea ArdT,(X;ê) + errors, 
(14.87) 


estimate this equation by standard FE, and test the joint significance of the T — 1 
terms d2,(x/é),...,dT;(x;é). (The term ¥;€ would drop out of an FE estimation, so 
we just omit it.) Note that x,é is a scalar, and so the test has T — 1 degrees of free- 
dom. As always, it is prudent to use a fully robust test (even though, under the null, 
2a; disappears from the error term). 

A few comments about this test are in order. First, although we used the Mundlak 
device (14.85) to obtain the test, it does not have to represent the actual linear pro- 
jection because we are simply adding terms to an FE estimation. Under the null, we 
do not need to restrict the relationshp between c; and x;. Of course, the power of the 
test may be affected by this choice. Second, the test only makes sense if č # 0; in 
particular, it cannot be used in an RE environment. Third, a rejection of the null does 
not necessarily mean that the usual FE estimator is inconsistent for f: assumption 
(14.84) could still hold. In fact, the change in the estimate of 6 when the interaction 
terms are added can be indicative of whether accounting for time-varying 7, is likely 
to be important. 

Unfortunately, when we reject constant 7,—that is, 2, 4 0—we cannot simply use 
the estimate of £ from (14.87), even under (14.85). The reason is that Ê has been 
estimated under the null, and so Ê would not be generally consistent for č under the 
alternative. If we feel comfortable imposing (14.85), then we could estimate (14.86) 
using minimum distance estimation. The first-stage estimation is linear and estimates, 
along with separate intercepts and f, T separate coefficient vectors, say č, on the 
time average x;. After pooled OLS estimation, we can impose the restrictions č, = 
nč, t= 1,...,T. We can also apply GMM. This becomes a nonlinear GMM prob- 
lem because the equation is nonlinear in the parameters 2, and č, but the GMM 
theory we developed in Section 14.1 can be used. If we do not want to use over- 
identifying restrictions (that would come from the assumption that x; is uncorrelated 
with the error (1+ 4,)a; + uir for all r and t), we could simply estimate (14.86) by 
NLS. Alternatively, we could use the Chamberlain device and replace X;č with x,¢, 
where ¢ is TK x 1. This would impose no restrictions on the relationship between c; 
and x;. 

Typically, when we want to allow arbitrary correlation between c; and x;, we work 
directly from (14.81) and eliminate the c;. There are several ways to do this. If we 
maintain that all y, are different from zero, then we can use a quasi-differencing 
method to eliminate c;. In particular, for t > 2 we can multiply the ¢t — 1 equation by 
7,/,-; and subtract the result from the time ¢ equation: 


Generalized Method of Moments and Minimum Distance Estimation 555 


Vie — (M/M) Vi, 2-1 = [Xie — eM) Xi, 1B + Pei — (1/1) 1c) 
+ [ua — (1/11), -1] 
= [xa — Ce /My_1) Xi, 1-1 B + [Wie — (te / 1 Ui, 1-1), t22. 
We define 0; = 7,/n,_, and write 
Vie — OtYi -1 = (Xit — OFX; 1-1) B + ĉit, APE i (14.88) 


where ej = Uit — Ori t1. Under the strict exogeneity assumption, ex is uncorrelated 
with every element of x;, and so we can apply GMM to (14.88) to estimate # and 
(02,...,97r). Again, this requires using nonlinear GMM methods, and the e; would 
typically be serially correlated. If we do not impose restrictions on the second moment 
matrix of u;, then we would not use any information on the second moments of e;; we 
would (eventually) use an unrestricted weighting matrix after an initial estimation. 

Using all of x; in each time period can result in too many overidentifying restric- 
tions. At time ¢ we might use, say, Z = (Xj,X;,--1), and then the instrument matrix 
Zi (with T — 1 rows) would be diag(zj2,...,z;7). An initial consistent estimator can 
be gotten by choosing weighting matrix (N7! ee i ZIZ), just as with the system 
2SLS estimator in Chapter 8. Then the optimal weighting matrix can be estimated. 
Ahn, Lee, and Schmidt (2002) provide further discussion. 

If x;, contains sequentially but not strictly exogenous explanatory variables—such 
as a lagged dependent variable—the instruments at time ¢ can only be chosen from 
(X;-1,---,Xi), Just as in Section 11.6.1. Holtz-Eakin, Newey, and Rosen (1988) 
explicitly consider models with lagged dependent variables. 

Other transformations can be used. For example, at time ¢ > 2 we can use the 
equation 


M-1Vit — MeVi,te-1 = (M1 Xie — NXi,t-1)P + eit, t=2,...,T, 


where now ej = N; 1Uit — 7,Ui,.-1. This equation has the advantage of allowing 7, = 0 
for some ¢. The same choices of instruments are available depending on whether {xj} 
are strictly or sequentially exogenous. 


Problems 


14.1. Consider the system in equations (14.34) and (14.35). 


a. How would you estimate equation (14.35) using single-equation methods? Give 
a few possibilities, ranging from simple to more complicated. State any additional 
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assumptions relevant for estimating asymptotic variances or for efficiency of the 
various estimators. 


b. Is equation (14.34) identified if y; = 0? 


c. Now suppose that y, = 0, so that the parameters in equation (14.35) can be con- 
sistently estimated by OLS. Let f, be the OLS fitted values. Explain why NLS esti- 
mation of 


Yı = X101 + y 93 + error 

does not consistently estimate 01, y,, and y, when y; 40 and y, 4 1. 

14.2. Consider the following labor supply function nonlinear in parameters: 
hours = 2,6, + y, (wage?! — 1)/p, + u, E(w |z) = 0, 


where zı contains unity and z is the full set of exogenous variables. 

a. Show that this model contains the level-level and level-log models as special cases. 
(Hint: For w > 0, (w? — 1)/p > log(w) as p > 0.) 

b. How would you test Ho: y; = 0? (Be careful here; p} cannot be consistently esti- 
mated under Ho.) 


c. Assuming that y; # 0, how would you estimate this equation if Var(w) |z) = 07? 
What if Var(u; |z) is not constant? 


d. Find the gradient of the residual function with respect to ôi, y}, and p,. (Hint: 
Recall that the derivative of w” with respect to p is w? log(w).) 


e. Explain how to obtain the score test of Ho: p} = 1. 


14.3. Use Theorem 14.3 to show that the optimal instrumental variables based on 
the conditional moment restrictions (14.60) are given by equation (14.63). 


14.4. a. Show that, under Assumptions WNLS.1-WNLS.3 in Chapter 12, the 
weighted NLS estimator has asymptotic variance equal to that of the efficient IV es- 
timator based on the orthogonality condition E|(y; — m(x;, B,)) | xi] = 0. 

b. When does the NLS estimator of p, achieve the efficiency bound derived in part a? 
c. Suppose that, in addition to E(y |x) = m(x,f,), you use the restriction Var(y | x) 
= gè for some g > 0. Write down the two conditional moment restrictions for esti- 
mating £, and ag. What are the efficient instrumental variables? 


14.5. Write down 0, m, and the matrix H such that 2 = H0 in Chamberlain’s 
approach to unobserved effects panel data models when T = 3. 
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14.6. Let z and z be two consistent estimators of zo, with Avar VN N(# — to) = Eo 
and Avar //N(#— To) = Ao. Let Ê be the CMD estimator based on #, and let 6 be 
the CMD estimator based on z, where To = h(0,). Show that, if Ag — Eo is p.s.d., 
then so is Avar VN N (0 — 0,) — Avar VN N (ô -— 0o). (Hint: Twice use the fact that, for 
two positive definite matrices A and B, A — B is p.s.d. if and only if Bo! — A! 


p.s.d.) 


14.7. Show that when the mapping from @ to mo is linear, mo = H@ for a known 
S x P matrix H with rank(H) = P, the CMD estimator 6 is 


6 = (H'="'H)'H'E â (14.89) 


Equation (14.89) looks like a generalized least squares (GLS) estimator of z on 
H using variance matrix =, and this apparent similarity has prompted some to call 
the minimum chi-square estimator a “generalized least squares” (GLS) estimator. 
Unfortunately, the association between CMD and GLS is misleading because ñ 
and H are not data vectors whose row dimension, S, grows with N. The asymptotic 
properties of the minimum chi-square estimator do not follow from those of GLS. 


14.8. In Problem 13.9, suppose you model the unconditional distribution of yọ as 
Fo(¥;9), which depends on at least some elements of 0 appearing in f;(y,| ¥,134). 
Discuss the pros and cons of using fọ( yọ; 0) in a maximum likelihood analysis along 


with f(y, | ¥,-139), t= 1,2,...,T7 


14.9. Verify that, for the linear unobserved effects model under Assumptions RE.1— 
RE.3, the conditions of Lemma 14.1 ree for the fixed effects (0) and the ran- 
dom effects (,) estimators, with p = o2. (Hint: For clarity, it helps to introduce a 
cross section subscript, i. Then Ay = E(X! X;), where X; = X; — AjrXi; Ao = E(X/X;), 
where X; = X; — irXi3 Sa = X! tj, where r; = v; — Ajr ois and sp = X! ;Uj; see Chapter 
10 for further notation. You should show that X/u; = X/r; and then X/X; = X/X;.) 


14.10. Consider the model in (14.81) under the strict exogeneity condition (14.82). 
In addition, assume that E(c;|x;) = 0 (and so x; should contain a full set of time 
dummies, but we do not show them explicitly). 

a. If vi = N,Ci + uu, show that E(v;,|x;) = 0, t= 1,. 

b. Assume that Var(u;|x;, ci) = 9217 and Var(c;|x;) = oł. Find Var(v;;) and 
Cov(vit, vis), t # S. 

c. Under parts a and b, propose an estimator that is symptotically more efficient than 
the usual RE estimator. 
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14.11. Consider a multivariate regression model nonlinear in the parameters, 
y; = Xig(0o) +u,  E(u;|X;) = 0, 


where y; is G x 1, X; is G x K, and g : R?” — R* is continuously differentiable. Ex- 
plain how to estimate 0) using CMD. 


Appendix 14A 

Proof of Lemma 14.1: Given condition (14.49), Ay = (1/p)E(sıs{), a P x P sym- 
metric matrix, and 

Vi = AȚ'E(sisi)AŢ! = p’°[E(siısi)] ', 


where we drop the argument w for notational simplicity. Next, under condition 
(14.56), A2 = (1/p)E(s5s1), and so 


Vx = A7 E(s283)(A%)™* = p° [E(s281)]' E(s2s})[E(s1s))] 


Now we use the standard result that V — Vı is positive semidefinite if and only if 
VI = V3 ' is p.s.d. But, dropping the term p? (which is simply a positive constant), 
we have 


VI! — Vz! = E(sis}) — E(sis))[E(sos3)] 'E(sos}) = E(rir}), 


where rı is the P x 1 population residual from the population regression sı on s2. As 
E(rır{) is necessarily p.s.d., this step completes the proof. 


I y NONLINEAR MODELS AND RELATED TOPICS 


We now apply the general methods of Part III to study specific nonlinear models that 
often arise in applications. Many nonlinear econometric models are intended to ex- 
plain limited dependent variables. Roughly, a limited dependent variable is a variable 
whose range is restricted in some important way. Most variables encountered in 
economics are limited in range, but not all require special treatment. For example, 
many variables—wage, population, and food consumption, to name just a few—can 
only take on positive values. If a strictly positive variable takes on numerous values, 
we can avoid special econometric tools by taking the log of the variable and then 
using a linear model. 

When the variable to be explained, y, is discrete and takes on a finite number of 
values, it makes little sense to treat it as an approximately continuous variable. Dis- 
creteness of y does not in itself mean that a linear model for E( y | x) is inappropriate. 
However, in Chapter 15 we will see that linear models have certain drawbacks for 
modeling binary responses, and we will treat nonlinear models such as probit and 
logit. We cover basic multinomial response models in Chapter 16, including the case 
when the response has a natural ordering. 

Other kinds of limited dependent variables arise in econometric analysis, especially 
when modeling choices by individuals, families, or firms. Optimizing behavior often 
leads to corner solutions for some nontrivial fraction of the population. For example, 
during any given time, a fairly large fraction of the working age population does not 
work outside the home. Annual hours worked has a population distribution spread 
out over a range of values, but with a pileup at the value zero. While it could be that 
a linear model is appropriate for modeling expected hours worked, a linear model 
will likely lead to negative predicted hours worked for some people. Taking the nat- 
ural log is not possible because of the corner solution at zero. In Chapter 17 we dis- 
cuss econometric models that are better suited for describing these kinds of limited 
dependent variables. 

In Chapter 18 we cover count, fractional, and other nonnegative response vari- 
ables. Our emphasis is on estimating the conditional mean, and therefore we focus on 
estimation methods that do not require specific distributional assumptions (although 
they may nominally specify a distribution in a quasi-maximum likelihood analysis). 

In Chapter 19 we shift gears and study several problems concerning missing data, 
including data censoring, sample selection, and attrition. This is the first chapter 
where we confront the possibility that the sample we have to work with is not nec- 
essarily a random sample on all variables appearing in the underlying population 
model. Chapter 20 treats additional sampling issues, including stratified sampling 
and cluster sampling. 
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Chapter 21 considers estimation of treatment effects, where we explicitly introduce 
a counterfactual setting that is fundamental to the contemporary literature. This 
chapter shows how switching regression models and random coefficient models (with 
endogenous explanatory variables) fits into the treatment effect framework. Chapter 
22 concludes with an introduction to modern duration analysis. 


l 5 Binary Response Models 


15.1 Introduction 


In binary response models, the variable to be explained, y, is a random variable tak- 
ing on the values zero and one, which indicate whether or not a certain event has 
occurred. For example, y = 1 if a person is employed, y = 0 otherwise; y = 1 if a 
family contributes to charity during a particular year, y = 0 otherwise; y = 1 if a firm 
has a particular type of pension plan, y = 0 otherwise. Regardless of the definition of 
y, it is traditional to refer to y = 1 as a success and y = 0 as a failure. 

As in the case of linear models, we often call y the explained variable, the response 
variable, the dependent variable, or the endogenous variable; x = (x1, X2,..., Xg) is 
the vector of explanatory variables, regressors, independent variables, exogenous 
variables, or covariates. 

In binary response models, interest lies primarily in the response probability, 


P(x) = P(y = 1] x) = P(y = 1] x1, %,..., xx), (15.1) 


for various values of x. For example, when y is an employment indicator, x might 
contain various individual characteristics such as education, age, marital status, and 
other factors that affect employment status, such as a binary indicator variable for 
participation in a recent job training program, or measures of past criminal behavior. 

For a continuous variable, x;, the partial effect of x; on the response probability is 


OP(y=1 |x) _ p(x) 
Ox; o Ox; i 


(15.2) 


When multiplied by Ax;, equation (15.2) gives the approximate change in P(y = 1 |x) 
when x; increases by Ax;, holding all other variables fixed (for “small”? Ax;). Of 
course if, say, x; = z and x = z? for some variable z (for example, z could be work 
experience), we would be interested in dp(x) /dz. 

If xx is a binary variable, interest lies in 


Pip Xtpe eng heats 1) — p(x1, X2, - . -, XK-1,0), (15.3) 


which is the difference in response probabilities when xg = 1 and xg = 0. For most 
of the models we consider, whether a variable x; is continuous or discrete, the partial 
effect of x; on p(x) depends on all of x. 

In studying binary response models, we need to recall some basic facts about 
Bernoulli (zero-one) random variables. The only difference between the setup here 
and that in basic statistics is the conditioning on x. If P(y = 1|x)= p(x) then 
P(y = 0|x) = 1— p(x), E( |x) = p(x), and Var(»|x) = p(x)[1 — p(x)]. 


562 Chapter 15 


15.2 Linear Probability Model for Binary Response 


The linear probability model (LPM) for binary response y is specified as 
P(y = 1|x) = Bo + Byx1 + Byx2 + +++ + BqxK. (15.4) 


As usual, the x; can be functions of underlying explanatory variables, which would 
simply change the interpretations of the £;. Assuming that x) is not functionally re- 
lated to the other explanatory variables, 6, = 0P(y = 1|x)/0x,. Therefore, $; is the 
change in the probability of success given a one-unit increase in xı. If x; is a binary 
explanatory variable, /, is just the difference in the probability of success when 
xı = l and x; = 0, holding the other x; fixed. 

Using functions such as quadratics, logarithms, and so on among the independent 
variables causes no new difficulties. The important point is that the f; now measures 
the effects of the explanatory variables x; on a particular probability. 

In deciding on an appropriate estimation technique, it is useful to derive the con- 
ditional mean and variance of y. Since y is a Bernoulli random variable, these are 
simply 


E(y|x) = Bo + pixi + Box2 ++ +> + Bexx (15.5) 
Var(y |x) = xB(1 — xf), (15.6) 


where xf is shorthand for the right-hand side of equation (15.5). 

Equation (15.5) implies that, given a random sample, the ordinary least squares 
(OLS) regression of y on 1,x1,%2,...,Xx produces consistent and even unbiased 
estimators of the £;. Equation (15.6) means that heteroskedasticity is present unless 
all of the slope coefficients /,,...,f, are zero. A nice way to deal with this issue is 
to use standard heteroskedasticity-robust standard errors and ¢ statistics. Further, 
robust tests of multiple restrictions should also be used. There is one case where the 
usual F statistic can be used, and that is to test for joint significance of all vari- 
ables (leaving the constant unrestricted). This test is asymptotically valid because 
Var(y|x) is constant under this particular null hypothesis. 

If we operate under the assumption that P(y = 1|x) is given by equation (15.4), 
then we can obtain an asymptotically more efficient estimator by applying weighted 
least squares (WLS). Let f be the OLS estimator, and let f; denote the OLS fitted 
values. Then, provided 0 < f; < 1 for all observations i, define the estimated standard 
deviation as 6; = [9,(1 — f;)]!”. Then the WLS estimator, £*, is obtained from the 
OLS regression 


Binary Response Models 563 


y;/6; on 1/6;,x1/Gi,---,Xix /Gi, iS ee Pees Ne (15.7) 


The usual standard errors from this regression are valid, as follows from the treat- 
ment of weighted least squares in Chapter 12. In addition, all other testing can be 
done using F statistics or LM statistics using weighted regressions. 

If some of the OLS fitted values are not between zero and one, WLS analysis is not 
possible without ad hoc adjustments to bring deviant fitted values into the unit in- 
terval. Further, since the OLS fitted value f; is an estimate of the conditional proba- 
bility P(y; = 1 | x;), it is somewhat awkward if the predicted probability is negative or 
above unity. 

Aside from the issue of fitted values being outside the unit interval, the LPM 
implies that a ceteris paribus unit increase in x; always changes P(y = 1|x) by the 
same amount, regardless of the initial value of x;. This feature of the LPM cannot 
literally be true because continually increasing one of the x; would eventually drive 
P(y = 1 |x) to be less than zero or greater than one. 

A sensible posture is to simply view the LPM as the linear projection of y on the 
explanatory variables. Recall from Chapter 2 that the linear projection provides the 
best least squares fit among linear functions of x (although x itself might include 
nonlinear functions of underlying explanatory variables). If we operate within this 
scenario, OLS estimation of the LPM is attractive because it consistently estimates 
the parameters in the linear projection. The WLS estimator can be viewed as a linear 
projection on weighted variables, which may not be of as much interest, particularly 
because the weights are not obtained from Var(y|x) if (15.4) fails. 

If the main purpose of estimating a binary response model is to approximate the 
partial effects of the explanatory variables, averaged across the distribution of x, then 
the LPM often does a very good job. (Some evidence on how well it does can be 
obtained by comparing the OLS coefficients with the average partial effects from the 
nonlinear models we turn to in Section 15.3.) The fact that some predicted proba- 
bilities are outside the unit interval need not be a serious concern. But there is no 
guarantee that the LPM provides good estimates of the partial effects for a wide 
range of covariate values, especially for extreme values of x. 


Example 15.1 (Married Women’s Labor Force Participation): We use the data from 
MROZ.RAW to estimate a linear probability model for labor force participation 
(inlf) of married women. Of the 753 women in the sample, 428 report working non- 
zero hours during the year. The variables we use to explain labor force participation 
are age, education, experience, nonwife income in thousands (nwifeinc), number of 
children less than six years of age (Aids/t6), and number of children between 6 and 18 
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inclusive (kidsge6); 606 women report having no young children, while 118 report 
having exactly one young child. The usual OLS standard errors are in parentheses, 
while the heteroskedasticity-robust standard errors are in brackets: 


inif = .586 — .0034nwifeinc + .038 educ + .039 exper — .00060 exper? 


(.154) (.0014) (.007) (.006) (.00018) 
[.151]  [.0015] [.007] [.006] [.00019] 
— Ol6age— .262kidsit6+ .013 kidsge6 

(.002) (.034) (.013) 

[.002] [.032] [.013] 


N=753, R2=.264 


With the exception of kidsge6, all coefficients have sensible signs and are statistically 
significant; kidsge6 is neither statistically significant nor practically important. The 
coefficient on nwifeinc means that if nonwife income increases by 10 ($10,000), the 
probability of being in the labor force is predicted to fall by .034. This is a small effect 
given that an increase in income by $10,000 in 1975 dollars is very large in this sam- 
ple. (The average of nwifeinc is about $20,129 with standard deviation $11,635.) 
Having one more small child is estimated to reduce the probability of in/f = 1 by 
about .262, which is a fairly large effect. 

Of the 753 fitted probabilities, 33 are outside the unit interval. Rather than using 
some adjustment to those 33 fitted values and applying WLS, we just use OLS and 
report heteroskedasticity-robust standard errors. Interestingly, these differ in practi- 
cally unimportant ways from the usual OLS standard errors. 


The case for the LPM is even stronger if most of the x; are discrete and take on 
only a few values. In the previous example, to allow a diminishing effect of young 
children on the probability of labor force participation, we can break kidslt6 into 
three binary indicators: no young children, one young child, and two or more young 
children. The last two indicators can be used in place of kids/t6 to allow the first 
young child to have a larger effect than subsequent young children. (Interestingly, 
when this method is used, the marginal effects of the first and second young children 
are virtually the same. The estimated effect of the first child is about —.263, and the 
additional reduction in the probability of labor force participation for the next child 
is about —.274.) 

In the extreme case where the model is saturated—that is, x contains dummy vari- 
ables for mutually exclusive and exhaustive categories—the LPM is completely gen- 
eral. The fitted probabilities are simply the average y; within each cell defined by the 
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different values of x; we need not worry about fitted probabilities less than zero or 
greater than one. See Problem 15.1. 


15.3 Index Models for Binary Response: Probit and Logit 


We now study binary response models of the form 
P(y = 1 |x) = G(xf) = p(x), (15.8) 


where x is 1 x K, Bis K x 1, and we take the first element of x to be unity. Examples 
where x does not contain unity are rare in practice. For the linear probability model, 
G(z) = z is the identity function, which means that the response probabilities cannot 
be between 0 and 1 for all x and £. In this section we assume that G(-) takes on values 
in the open unit interval: 0 < G(z) < 1 for all z e R. 

The model in equation (15.8) is generally called an index model because it restricts 
the way in which the response probability depends on x: p(x) is a function of x only 
through the index xB = pı + pax +---+fPxxx. The function G maps the index into 
the response probability. 

In most applications, G is a cumulative distribution function (cdf) whose specific 
form can sometimes be derived from an underlying economic model. For example, in 
Problem 15.2 you are asked to derive an index model from a utility-based model of 
charitable giving. The binary indicator y equals unity if a family contributes to charity 
and zero otherwise. The vector x contains family characteristics, income, and the price 
of a charitable contribution (as determined by marginal tax rates). Under a normality 
assumption on a particular unobservable taste variable, G is the standard normal cdf. 

Index models where G is a cdf can be derived more generally from an underlying 
latent variable model, as in Example 13.1: 


yp =xpt+e, y= 1[y* > 0), (15.9) 


where e is a continuously distributed variable independent of x and the distribution 
of e is symmetric about zero; recall from Chapter 13 that 1]-] is the indicator function. 
If G is the cdf of e, then, because the pdf of e is symmetric about zero, 1 — G(—z) = 
G(z) for all real numbers z. Therefore, 


P(y = 1|x) = P(y* > 0|x) = P(e > -xf |x) = 1 — G(—xf) = G(x), 


which is exactly equation (15.8). 

There is no particular reason for requiring e to be symmetrically distributed in the 
latent variable model, but this happens to be the case for the binary response models 
applied most often. 
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In most applications of binary response models, the primary goal is to explain the 
effects of the x; on the response probability P(y = 1|x). The latent variable formu- 
lation tends to give the impression that we are primarily interested in the effects of 
each x; on y*. As we will see, the direction of the effects of x; on E(y* |x) = xf and 
on E(y|x) = P(y = 1|x) = G(x) are the same. But the latent variable y* rarely has 
a well-defined unit of measurement (for example, y* might be measured in utility 
units). Therefore, the magnitude of f; is not especially meaningful except in special 
cases. 

The probit model is the special case of equation (15.8) with 


G(z) = Oz) = iz Ho) do, (15.10) 


where ¢(z) is the standard normal density 
olz) = (2n)? exp(—z?/2). (15.11) 


The probit model can be derived from the latent variable formulation when e has a 
standard normal distribution. 
The logit model is a special case of equation (15.8) with 


G(z) = A(z) = exp(z)/[1 + exp(z)]. (15.12) 


This model arises from the model (15.9) when e has a standard logistic distribution. 

The general specification (15.8) allows us to cover probit, logit, and a number of 
other binary choice models in one framework. In fact, in what follows we do not even 
need G to be a cdf, but we do assume that G(z) is strictly between zero and unity for 
all real numbers z. 

In order to successfully apply probit and logit models, it is important to know how 
to interpret the f, on both continuous and discrete explanatory variables. First, if x; 
1s continuous, 


{P) _ gtxp)p,, where g(z) = 82 


(15.13) 


Ill 
N 
— 
N 
sat 


Ox; 


Therefore, the partial effect of x; on p(x) depends on x through g(xf). If G(-) is a 
strictly increasing cdf, as in the probit and logit cases, g(z) > 0 for all z. Therefore, 
the sign of the effect is given by the sign of £;. Also, the relative effects do not depend 
on x: for continuous variables x; and x}, the ratio of the partial effects is constant and 


Op(x)/Ox; _ B;/B,,. In the typical 


iven by the ratio of the corresponding coefficients: ———_—+ 
g y p g Opx) [ax 
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case that g is a symmetric density about zero, with unique mode at zero, the largest 
effect is when xf = 0. For example, in the probit case with g(z) = ¢(z), g(0) = (0) 
= 1/V/2n 7x .399. In the logit case, g(z) = exp(z)/[1 + exp(z)]’, and so g(0) = .25. 

If xx 1s a binary explanatory variable, then the partial effect from changing xx 
from zero to one, holding all other variables fixed, is simply 


G(B, + Box2 + +++ + B-1xK-1 + Bx) — G(B, + Box + +++ Prax). (15.14) 


Again, this expression depends on all other values of the other x;. For example, if y is 
an employment indicator and x; is a dummy variable indicating participation in a job 
training program, then expression (15.14) is the change in the probability of em- 
ployment due to the job training program; this depends on other characteristics that 
affect employability, such as education and experience. Knowing the sign of fx is 
enough to determine whether the program had a positive or negative effect. But to 
find the magnitude of the effect, we have to estimate expression (15.14). 

We can also use the difference in expression (15.14) for other kinds of discrete 
variables (such as number of children). If xg denotes this variable, then the effect on 
the probability of xg going from cx to cx + 1 is simply 


GB, + Box2 + +++ + Be_1xK-1 + Bx(ex + 1)] 
— G( bi + Boxa +++ + Bey XK-1 + Beek). (15.15) 


It is straightforward to include standard functional forms among the explanatory 
variables. For example, in the model 


P(y = 1|z) = G[Bo +8121 + Boz} + Bs log(z2) + 2423], 


the partial effect of zı on P(y = 1 |z) is 0P(y = 1 |z)/ôzı = g(xB)(f, + 2fo21), where 
xP = By + b121 + b22? + B3 log(z2) + 2423. It follows that if the quadratic in zı has a 
hump shape or a U shape, the turning point in the response probability is |£; /(2f>)| 
(because g(xf) > 0). Also, ôP(y = 1 |z)/ô log(z2) = g(xB)f3, and so g(xP)( 63/100) 
is the approximate change in P(y = 1|z) given a 1 percent increase in z2. Models 
with interactions among explanatory variables, including interactions between dis- 
crete and continuous variables, are handled similarly. When measuring the effects of 
discrete variables, we should use expression (15.15). 


15.4 Maximum Likelihood Estimation of Binary Response Index Models 


Assume we have N independent, identically distributed observations following the 
model (15.8). Since we essentially covered the case of probit in Chapter 13, the 
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discussion here will be brief. To estimate the model by (conditional) maximum like- 
lihood (MLE), we need the log-likelihood function for each i. The density of y; given 
x; can be written as 


f(y |x B) = [CAP - GB], y=0,1. (15.16) 


The log likelihood for observation i is a function of the K x 1 vector of parameters 
and the data (x;, y;): 


C(B) = yi loglG(xiB)] + (1 — yi) log[l — G(xiB)]. (15.17) 


(Recall from Chapter 13 that, technically speaking, we should distinguish the “true” 
value of beta, £,, from a generic value. For conciseness we do not do so here.) 
Restricting G(-) to be strictly between zero and one ensures that /;(#) is well defined 
for all values of $. 

As usual, the log likelihood for a sample size of N is Z ($) = pe 14(ß), and the 
MLE of f, denoted f, maximizes this log likelihood. If G(-) is the standard normal 
cdf, then Ê is the probit estimator; if G(-) is the logistic cdf, then Ê is the logit esti- 
mator. From the general maximum likelihood results we know that B is consistent 
and asymptotically normal. We can also easily estimate the asymptotic variance B. 

We assume that G(-) is twice continuously differentiable, an assumption that is 
usually satisfied in applications (and, in particular, for probit and logit). As before, 
the function g(z) is the derivative of G(z). For the probit model, g(z) = ¢(z), and for 
the logit model, g(z) = exp(z)/[1 + exp(z)]’. 

Using the same calculations for the probit example as in Chapter 13, the score of 
the conditional log likelihood for observation i can be shown to be 


9(XiB)xily; — G(xiB)| 


s;(B) = CAU GAN (15.18) 
Similarly, the expected value of the Hessian conditional on x; is 
2g! 
Xj X;Xi 
—E{H,(p) |x] = 0 A(x; p), (15.19) 


{G(xiB)[1 — Gp} 
which is a K x K positive semidefinite (p.s.d.) matrix for each i. From the general 


conditional MLE results in Chapter 13, Avar(f) is estimated as 


—_. 


=i 
Avar(B {da =e Te =V. (15.20) 
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In most cases the inverse exists, and when it does, V is positive definite. If the matrix 
in equation (15.20) is not invertible, then perfect collinearity probably exists among 
the regressors. 

As usual, we treat B as being normally distributed with mean zero and variance 
matrix in equation (15.20). The (asymptotic) standard error of Ê; is the square root of 
the jth diagonal element of V. These can be used to construct ¢ statistics, which have 
a limiting standard normal distribution, and to construct approximate confidence 
intervals for each population parameter. These are reported with the estimates for 
packages that perform logit and probit. We discuss multiple hypothesis testing in the 
next section. 

Some packages also compute Huber-White standard errors as an option for probit 
and logit analysis, using the sandwich form (13.71). While the robust variance matrix 
is consistent, using it in equation (15.20) means we must think that the binary re- 
sponse model is incorrectly specified. Unlike with nonlinear regression, in a binary 
response model it is not possible to correctly specify E(y|x) but to misspecify 
Var(y|x). Once we have specified P(y = 1|x), we have specified all conditional 
moments of y given x. Nevertheless, as we discussed in Section 13.11, it may be 
prudent to act as if all models are merely approximations to the truth, in which case 
inference should be based on the sandwich estimator in equation (13.71). (The sand- 
wich estimator that uses the expected Hessian, as in equation (12.49), makes no sense 
in the binary response context because the expected Hessian cannot be computed 
without assumption (15.8).) 

In Section 15.8 we will see that, when using binary response models with panel 
data, it is sometimes important to compute variance matrix estimators that are robust 
to serial dependence. But this need arises as a result of dependence across time or 
subgroup, and not because the response probability is misspecified. 


15.5 Testing in Binary Response Index Models 


Any of the three tests from general MLE analysis—the Wald, likelihood ratio (LR), or 
Lagrange multiplier (LM) test—can be used to test hypotheses in binary response con- 
texts. Since the tests are all asymptotically equivalent under local alternatives, the choice 
of statistic usually depends on computational simplicity (since finite sample compar- 
isons must be limited in scope). In the following subsections we discuss some testing 
situations that often arise in binary choice analysis, and we recommend particular 
tests for their computational advantages. 
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15.5.1 Testing Multiple Exclusion Restrictions 
Consider the model 
P(y = 1|x,z) = G(xB + z7), (15.21) 


where x is 1 x K and zis 1 x Q. We wish to test the null hypothesis Ho : y = 0, so we 
are testing Q exclusion restrictions. The elements of z can be functions of x, such as 
quadratics and interactions—in which case the test is a pure functional form test. Or, 
the z can be additional explanatory variables. For example, z could contain dummy 
variables for occupation or region. In any case, the form of the test is the same. 

Some packages, such as Stata, compute the Wald statistic for exclusion restrictions 
using a simple command following estimation of the general model. This capability 
makes it very easy to test multiple exclusion restrictions, provided the dimension of 
(x, z) is not so large as to make probit estimation difficult. 

The likelihood ratio statistic is also easy to use. Let Z, denote the value of the log- 
likelihood function from probit of y on x and z (the unrestricted model), and let Y, 
denote the value of the likelihood function from probit of y on x (the restricted 
model). Then the LR test of Ho : y = 0 is simply 2(%,, — Z), which has an asymp- 
totic Xo distribution under Ho. This is analogous to the usual F statistic in OLS 
analysis of a linear model. 

The score or LM test is attractive if the unrestricted model is difficult to estimate. 
In this section, let B denote the restricted estimator of B, that is, the probit or logit 
estimator with z excluded from the model. The LM statistic using the estimated 
expected Hessian, A; (see equation (15.20) and Section 12.6.2), can be shown to be 
numerically identical to the following: (1) Define ù; = y; — G(x;B), G; = G(x;B), and 
ĝ; = g(x;B). These are all obtainable after estimating the model without z. (2) Use all 
N observations to run the auxiliary OLS regression 

a ee (15.22) 

Gi(1 — Gi) Gi(1 — G;) G;(1 — G;) 


The LM statistic is equal to the explained sum of squares from this regression. A test 
that is asymptotically (but not numerically) equivalent is NR2, where R? is the 
uncentered R-squared from regression (15.22). 

The LM procedure is rather easy to remember. The term g;x; is the gradient of the 
mean function G(x;f + z;y) with respect to £, evaluated at $ = B and y = 0. Simi- 
larly, ĝ;z; is the gradient of G(x;f + z:y) with respect to y, again evaluated at 8 = B 
and y = 0. Finally, under Ho : y = 0, the conditional variance of u; given (x;,z;) is 
G(x;B)[1 — G(x;B)|; therefore, [G;(1 — G)] 1/2 is an estimate of the conditional stan- 
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dard deviation of u;. The dependent variable in regression (15.22) is often called a 
standardized residual because it is an estimate of u;/[G;(1 — G;)]'””, which has unit 
conditional (and unconditional) variance. The regressors are simply the gradient of the 
conditional mean function with respect to both sets of parameters, evaluated under 
Ho, and weighted by the estimated inverse conditional standard deviation. The first 
set of regressors in regression (15.22) is 1 x K and the second set is 1 x Q. 

Under Ho, LM ~ Te: The LM approach can be an attractive alternative to the LR 
statistic if z has large dimension, since with many explanatory variables probit can be 
difficult to estimate. 


15.5.2 Testing Nonlinear Hypotheses about f 


For testing nonlinear restrictions on J in equation (15.8), the Wald statistic is com- 
putationally the easiest because the unrestricted estimator of f, which is just probit 
or logit, is easy to obtain. Actually imposing nonlinear restrictions in estimation— 
which is required to apply the score or likelihood ratio methods—can be difficult. 
However, we must also remember that the Wald statistic for testing nonlinear restric- 
tions is not invariant to reparameterizations, whereas the LM and LR statistics are. 
(See Sections 12.6 and 13.6; for the LM statistic, we would probably use the expected 
Hessian.) 

Let the restictions on f be given by Ho: e(f) = 0, where e(£) is a Q x 1 vector of 
possibly nonlinear functions satisfying the differentiability and rank requirements 
from Chapter 13. Then, from the general MLE analysis, the Wald statistic is simply 


W = c(B)'[Vpe(B)VVpe(B)']'e(B) (15.23) 


where V is given in equation (15.20) and Vge( Ê) is the Q x K Jacobian of e(f) evalu- 
ated at $. 


15.5.3 Tests against More General Alternatives 


In addition to testing for omitted variables, sometimes we wish to test the probit or 
logit model against a more general functional form. When the alternatives are not 
standard binary response models, the Wald and LR statistics are cumbersome to ap- 
ply, whereas the LM approach is convenient because it only requires estimation of 
the null model. 

As an example of a more complicated binary choice model, consider the latent 
variable model (15.9) but assume that e|x ~ Normal]0, exp(2x,d)], where x; is 
1 x Kı subset of x that excludes a constant and 6 is a Kı x 1 vector of additional 
parameters. (In many cases we would take x; to be all nonconstant elements of x.) 
Therefore, there is heteroskedasticity in the latent variable model, so that e is no 
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longer independent of x. The standard deviation of e given x is simply exp(x16). 
Define r = e/exp(x10), so that r is independent of x with a standard normal distri- 
bution. Then 


P(y = 1|x) = P(e > -xf | x) = Plexp(—xid)e > —exp(—x10) xf] 
= P[r > —exp(—x10)xf] = ®[exp(—x1d) xf}. (15.24) 


The partial effects of x; on P(y=1]|x) are much more complicated in equation 
(15.24) than in equation (15.8). When 6 = 0, we obtain the standard probit model. 
Therefore, a test of the probit functional form for the response probability is a test of 
Ho :6 =0. 

To obtain the LM test of 6 = 0 in equation (15.24), it is useful to derive the LM 
test for an index model against a more general alternative. Consider 


P(y = 1|x) = m(x,x, ô), (15.25) 


where ô is a Q x 1 vector of parameters. We wish to test Ho : ô = ôo, where do is 
often (but not always) a vector of zeros. We assume that, under the null, we obtain a 
standard index model (probit or logit, usually): 


G(xP) = m(xB, x, ôo). (15.26) 


In the previous example, G(-) = ®(-), ôo = 0, and m(xf,x,d) = ®lexp(—x1d)xf). 

Let Ê be the probit or logit estimator of f obtained under ô= 6y. Define 
ii; = yi — G(xjB), Ĝi = G(x;B), and g, = g(x;B). The gradient of the mean function 
m(x;P,x;,6) with respect to B, evaluated at do, is simply g(x;f)x;. The only other 
piece we need is the gradient of m(x;f,x;, ô) with respect to 6, evaluated at ôo. Denote 
this 1 x Q vector as Vym(x;P,x;, ôo). Further, set Vom; = Vm(x;ßĝ, X;, ôo). The LM 
statistic can be obtained as the explained sum of squares from the regression 

A o e ae (15.27) 

G;(1 — G;) G;(1 — G;) G;(1 — G;) 


which is quite similar to regression (15.22). The null distribution of the LM statistic is 
Xo» where Q is the dimension of ô. An asymptotically equivalent statistic is NR. 

When applying this test to the preceding probit example, we have only V;m; left to 
compute. But m(x;f, x;,6) = ®[exp(—x;10)x;f], and so 


Vo(x iP, Xi, ð) = —(xiB) exp(—xi16)xi1 dlexp(—x;10)x;f}. 


When evaluated at B= B and 6 = 0 (the null value), we get Vjm; = —(x;B)(xiB)xi 
= —(x;f)¢,;xi1, a 1 x Kı vector. Regression (15.27) becomes 
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ga ee (15.28) 
,/@;(1 — ®,) ,/@;(1 — ®,) ,/@,(1 — 

(We drop the minus sign because it does not affect the value of the explained sum of 
squares or R2.) Under the null hypothesis that the probit model is correctly specified, 
LM ~ XK, This statistic is easy to compute after estimation by probit. 

For a one-degree-of-freedom test regardless of the dimension of x;, replace the last 
term in regression (15.28) with (x;B)7¢,/V/®;(1 — ®,), and then the explained sum of 
squares is distributed asymptotically as v7. See Davidson and MacKinnon (1984) for 
further examples. 

As we discussed briefly in Chapters 12 and 13, variable addition versions of speci- 
fication tests can be somewhat easier to compute. Rather than running the regression 
in equation (15.28), we can instead estimate the auxiliary probit model with response 
probability ®[x;B + (x;B)x;16] and test Ho : 6 = 0 using a standard Wald test. Again, 
B denotes the original probit estimates from probit of yı on x;. The specification test 
is obtained from probit of y; on x;, (x;B)xi1 and testing the last Q terms for joint 
significance. It is easy to see that the variable addition test (VAT) is asymptotically 
equivalent to the score test, and the VAT circumvents the need to explicitly obtain 
the weighted residuals and gradients. The VAT approach also makes it clear that the 
score test for heteroskedasticity in the error from the latent model is indistinguishable 
from testing for a particular kind of interactive effect in the covariates. We might 
have arrived at such a test directly, without modeling Var(e | x). 


15.6 Reporting the Results for Probit and Logit 


Several statistics should be reported routinely in any probit or logit (or other binary 
choice) analysis. The A their standard errors, and the value of the likelihood func- 
tion are reported by all software packages that do binary response analysis. The Ê, 
give the signs of the partial effects of each x; on the response probability, and the 
statistical significance of x; is determined by whether we can reject Ho : f, = 0. 

One measure of goodness of fit that is sometimes reported is the percent correctly 
predicted. The easiest way to describe this statistic is to define a binary predictor of y; 
to be one if the predicted probability is at least .5, and zero otherwise. More precisely, 
define the binary variable J, =1 if G(x;ĝ) >.5 and j,=0 if G(x;ĝ) < .5. Given 
{y,:i=1,2,...,N}, we can see how well f; predicts y; across all observations. 
There are four possible outcomes on each pair, (y;, J;); when both are zero or both 
are one, we make the correct prediction. In the two cases where one of the pair is zero 
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and the other is one, we make the incorrect prediction. The percent correctly pre- 
dicted is the percent of times that J; = y;. (This goodness-of-fit measure can be 
computed for the linear probability model, too.) 

While the percent correctly predicted is useful as a goodness-of-fit measure, it can 
be misleading. In particular, it is possible to get rather high percentages correctly 
predicted even when the least likely outcome is very poorly predicted. For example, 
suppose that N = 200, 160 observations have y; = 0, and, out of these 160 observa- 
tions, 140 of the y; are also zero (so we correctly predict 87.5 percent of the zero 
outcomes.) Even if none of the predictions is correct when y; = 1, we still correctly 
predict 70% of all outcomes (140/200 = .70). Often, we hope be able to predict the 
least likely outcome (such as whether someone is arrested for committing a crime), 
and we only know how well we do that by obtaining the percent correctly predicted 
for each outcome. It is easily shown that the overall percent correctly predicted is the 
weighted average of the percent correctly predicted for y = 0 and y= 1, with the 
weights being the fraction of zero and one outcomes on y, respectively. 

Some have criticized the prediction rule described above for always using a 
threshold value of .5, especially when one of the outcomes is unlikely. For example, 
if y = .08 (only eight percent “successes” in the sample), it could be that we never 
predict y; = 1 because the estimated probability of success is never greater than .5. 
One alternative is to use the fraction of successes in the sample as the threshold—.08 
in the previous example. In other words, define ĵ; = 1 when G(x;B) > .08, and zero 
otherwise. Using this rule will certainly increase the number of predicted successes, 
but not without cost: we necessarily make more mistakes—perhaps many more—in 
predicting the zero outcomes (“failures”). In terms of the overall percent correctly 
predicted, we may actually do worse than when using the traditional .5 threshold. 

A third possibility is to choose the threshold such that the fraction of 7; = 1 in the 
sample is the same (or very close) to y. In other words, search over threshold values 
t, 0 <1 <1, such that if we define ĵ; = 1 when G(x;f) > q, then 7", 9, % Wy. 
(The trial-and-error effort required to find the desired value of t can be tedious, but it 
is feasible. In some cases, it will not be possible to make the number of predicted 
successes exactly the same as the number of successes in the sample.) Now, given the 
set of binary predictors y,;, we can compute the percent correctly predicted for each of 
the two outcomes, as well as the overall percent correctly predicted. 

Various pseudo-R-squared measures have been proposed for binary response. 
McFadden (1974) suggests the measure 1 — %,,/%, where Zr is the log-likelihood 
function for the estimated model and % is the log-likelihood function in the model 
with only an intercept. Because the log likelihood for a binary response model is 
always negative, |Z| < |2], and so the pseudo-R-squared is always between zero 
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and one. Alternatively, we can use a sum of squared residuals measure: 1 — SSR „/ 
SSRs, where SSR, is the sum of squared residuals ù; = y; — G(x;B) and SSR, is the 
total sum of squares of y,;. Several other measures have been suggested (see, for ex- 
ample, Maddala, 1983, Chap. 2), but goodness of fit is not as important as statistical 
and economic significance of the explanatory variables. Estrella (1998) contains a 
recent comparison of goodness-of-fit measures for binary response. 

Usually we want to estimate the effects of the variables x; on the response proba- 
bilities P(y = 1|x). If x; is (roughly) continuous, then 


AP(y = 1] x) = [EÂ] Ax; (15.29) 


for small changes in x;. (As usual when using calculus, the notion of “small” here is 
somewhat vague.) Therefore, the estimated partial effect of a continuous variable on 
the response probability, evaluated at x, is g (xB) B,. Because g(xf) depends on x, we 
need to decide which partial effects to report. We could report this scale factor at, 
say, medians of the explanatory variables, or at different quantiles. For summarizing 
the magnitudes of the effects, it is useful to have a single scale factor that can be used 
to multiply the coefficients on (roughly) continuous variables. Often the sample 
averages of the x; are plugged in to get g(XB), with x, =1 because we include a 
constant. We call the resulting partial effect the partial effect at the average (PEA). 

The PEA is easy to compute, but it does have drawbacks. First, it need not repre- 
sent the partial effect for any particular unit in the population. That is, the average 
of the explanatory variables may not, in any sensible way, represent the average unit 
in the population. One issue is that, if x contains nonlinear functions of underlying 
variables, such as logarithms, we must decide whether to use the average of the 
nonlinear function or the nonlinear function of the average. The latter has some 
appeal—for example, if log(inc) is an explanatory variable, evaluate log(inc) at inc, 
rather than use log(inc)—but software packages that support PEA estimation (such 
as Stata, with its mfx, for “marginal effects,” command) necessarily use the average 
of the nonlinear functions (because one must create the nonlinear functions before 
including them in logit or probit). 

If two or more elements of x are functionally related, such as quadratics or inter- 
actions, it is not even clear what the PEAs of individual coefficients mean. For ex- 
empl, suppose xg-ı = age and xx = age’. Then the reported PEAs for age and 
age” are g(XB) Bx ı and g(XB) Bx, respectively, where X = (1, X2, . . . , ¥K-2, age, age”). 
These PEAs do not tell us what we want to know about the parkal effect of age on 
P(y = 1 |x). For any x, the estimated partial effect is g(xP)(By_, + 2Bage). Now, 
we might be interested in evaluating this partial effect at the mean values, but that 
would entail using age, rather than age?, inside g(-). If we are really interested in the 
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effect of age on the response probability, we might want to evaluate the partial effect 
at several different values of age, perhaps evaluating the other explanatory variables 
at their means. If B xı and B x have different signs, the estimated turning point in the 
relationship is just as in models linear in the parameters: age* = |Bx_;/(2Bx)|. The 
usual turning point calculation works because g(xB) > 0 for all x. Similar care must 
be used for obtaining partial effects when interactions are included in x. 

For discrete variables, it is well known that the average need not even be a possible 
outcome of the variable. If, say, x2 is a binary variable, X is the fraction of ones in 
the sample, and therefore cannot correspond to a particular unit (which must have 
zero or one). For example, if x2 = female is a gender dummy, then the PEA is the 
partial effect when female is replaced with the fraction of women in the sample. One 
way to overcome this conceptual problem is to compute partial effects separately for 
X2 = l and x = 0, but then we no longer have a single number to report as the par- 
tial effect. Similar comments hold for other discrete variables, such as number of 
children. If the average is 1.5, plugging this value into g(-) does not produce the 
partial effect for any particular family. 

To obtain standard errors of the partial effects in equation (15.29) we can use the 
delta method. Consider the case 7 = K for notational simplicity, and for given x, 
define ôx = Byg(xP) = OP(y = 1|x)/dxx. Write this relation as dx = h(B) to denote 
that this is a (nonlinear) function of the vector f. We assume x, = 1. The gradient of 


h(B) is 
VAB) = | Bx SE (xB), Bue SE (xB), Bx- SEOB), Bare E(B) + g0), 


where dg/dz is simply the derivative of g with respect to its argument. The delta 
method implies that the asymptotic variance of 6x is estimated as 


[Vsh(B)|V[Vph(B)], (15.30) 


where Ý is the asymptotic variance estimate of f. The asymptotic standard error of 
Ox is simply the square root of expression (15.30). This calculation allows us to 
obtain a large-sample confidence interval for ôx. The program Stata does this calcu- 
lation for logit and probit using the mfx command. Alternatively, we can apply the 
bootstrap as discussed in Chapter 12. 

If xx is a discrete variable, then we can estimate the change in the predicted prob- 
ability in going from cx to cg + 1 as 


Ox = G[B, + Bok. +++» + Êk-1¥x-1 + Bx(ex + 1)] 
— GB, + Â- +++ + Be sSK-1 + Beek). (15.31) 
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In particular, when xx is a binary variable, set cx = 0. Of course, the other x;’s can 
be evaluated anywhere, but the use of sample averages is typical. The delta method 
can be used to obtain a standard error of equation (15.31). For probit, Stata does this 
calculation when xx is a binary variable. Usually the calculations ignore the fact that 
X; is an estimate of E(x;) in applying the delta method. If we are truly interested in 
Pxg(4,B), the estimation error in X can be accounted for, but it makes the calculation 
more complicated, and it is unlikely to have a large effect. The bootstrap would 
properly account for the sampling variation in X when X is recomputed for each 
bootstrap replication. 

An alternative way to summarize the estimated marginal effects is to estimate the 
average value of fgg(xß) across the population, or 2, E|g(xf)]. This quantity is the 
average partial effect (APE) that we discussed generally in Chapter 2, and also for 
linear models with random coefficients in Chapter 4. Here we are averaging across 
the distribution of all observable covariates. A consistent estimator of the APE is 


N 

Bx bo Som) (15.32) 
i=l 

when xx is continuous or 


N a 
Nol VIGA, + Byxir + + praxia + By) — GB, + fxn + + Êraixi)] 
=I 
(15.33) 


when xx is binary. Naturally, we can use equation (15.33) to estimate the APE from 
changing xx (continuous or discrete) from any two values, say c% to ch. Then Bx is 
replaced with Byc | in the first term and we insert fgc® into the second occurrence of 
G(-). Further, in either equation (15.32) or (15.33), we can fix some of the explana- 
tory variables at specific values and average across the remaining ones. For example, 
suppose that in a logit or probit model of employment, xx is a job training binary 
indicator and xx_ is a race indicator—say, one for nonwhite, zero for white. Then, 
rather than compute (15.33), we can set xx_; = 1 and compute the APE for non- 
whites, and then set xx_; = 0 to obtain the APE for whites. 

As in the case of the PEA, if some elements of x are functions of each other, 
obtaining APEs of the form in equation (15.32) is not especially useful. If, say, 
XK-1 = age and xx = age’, we can estimate the APE of age by averaging the indi- 
vidual partial effects, (Be. it 2B ,agei) x g(x;B), across i. Again, it probably makes 
more sense to evaluate the partial effect at m e values of age and then to average 
these across the other variables, say, N 15A (Êk + 2Beage’) x g(B, + Box: 
+++ Be Xj) K-2 + Êgiage? + Bx(age’)’) for a given value age”. 


578 Chapter 15 


Generally, we should not expect the scale factors for the PEAs and APEs to be 
similar. That is, we should not expect g(xf) to be similar to N~! yr 1 9(xiB). The 
reason is simple: the average of a nonlinear function (the APE scale) is not the non- 
linear function of the average (the PEA scale). In the population the scales are not the 
same either, because the expected value does not pass through nonlinear functions: 
glE(xP)] # Elg(xB)). 

Equation (15.33) has a nice interpretation for policy analysis, where xx is the bi- 
nary policy indicator (equal to one when the policy is in effect, zero otherwise). We 
can view each summand in (15.33) as follows. Regardless of whether unit i was sub- 
ject to the policy, we can obtain the predicted probability in each regime. The term in 
brackets in equation (15.33) is the difference in the estimated probability that y; = 1 
with and without participation. In other words, we compute the (counterfactual) 
effect of the policy for each cross section unit i, and then average these differences 
across all 7. This gives us an estimate of the average effect of the policy. We will have 
much more to say about a counterfactual framework for policy analysis in Chapter 
21 and the estimation of average treatment effects, of which equation (15.33) is an 
example. 

We can obtain standard errors of the estimators in (15.32) and (15.33) by applying 
the bootstrap to an average, as described in Section 12.8.2, or we can use the delta 
method; see Problem 15.15. 

Unlike the magnitudes of the coefficient estimates, the APEs (and, to a lesser ex- 
tent, the PEAs) can be compared across models. It is rare to find, say, probit and 
logit estimates having different signs unless the coefficients are estimated imprecisely. 
But logit, probit, and the LPM implicitly use different scale factors. The scale for the 
LPM is unity. For logit and probit, we can look at the functions g(-) at zero to obtain 
a rough idea of how the f;—at least the slopes on the continuous variables—may 
differ. For probit, g(0) ~ .4 and for logit g(0) = .25. Therefore, we expect the logit 
slope coefficients to have the largest magnitude, followed by the probit estimates, 
followed by the LPM estimates. In fact, sometimes some (crude) rules of thumb are 
adopted: multiply the probit coefficients by 1.6 and the LPM coefficients by 4 to 
make them roughly comparable to the logit estimates. The LPM estimates are mul- 
tiplied by 2.5 to make them comparable to the probit estimates. If Xf is close to zero, 
then g(x) will be close to .25 for logit and close to .4 for probit, and then the rules of 
thumb for adjusting coefficients can be justified based on the PEAs. (When one of the 
outcomes on y is much less likely than the other, it is unlikely that xf will be close to 
zero.) Generally, it is a good idea to compute the PEAs and APEs, as well as partial 
effects at other interesting values of x, or at values that allow us to determine policy 
effects for different groups in the population. 
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As a general rule, the estimates from an LPM estimation are more comparable to 
the APEs than to the PEAs. As mentioned above, the PEAs need not come close to 
representing an individual in the population. On the other hand, the APE is the par- 
tial effect averaged across the population and, in one case, we can show that the LPM 
estimates are consistent for the APEs regardless of the actual function G(-). In fact, 
using a result of Stoker (1986), we can say much more, for any kind of response 
variable y. Let m(x) = E(y|x) be the conditional mean, so that 4 = E[V,m/(x)’] is 
the vector of APEs, where now x just includes continuous covariates and no ele- 
ments are functionally related. Under assumptions about the distribution of x— 
which are generally satisified if x has convex, unbounded support—Stoker shows that 
E[Vxm(x)'] = —E|[m(x) x Vx log f(x)'] = —E[y x Vx log f(x)'] (where the second 
equality follows by iterated expectations), and so the vector of APEs is simply 
—E[y x Vx log f(x)']. Now, if x ~ Normal(#,,2Z,)—that is, if x has a multivariate 
normal distribution—then it is easily shown that Vx log f(x) = —(x — w,)Z,’. It 
follows by substitution that 


A= Ely x EZ (x = aty)'] = {E(X — A) (£ — Ha) Y E[(K — x)’, 


which is simply the vector of slope coefficients from the linear projection of y on 1, x. 
This calculation demonstrates that, regardless of the conditional mean function m(x), 
if x has a multivariate normal distribution, then a linear regression consistently esti- 
mates the APEs—or, in Stoker’s (1986) terminology, the average derivatives. 

The general result applies to binary responses. In fact, there is no reason to even 
specify the model in the index form P(y = 1 |x) = G(« + xf). But, in this case—for 
continuously differentiable functions G(-)—the vector of average partial effects is 
E|g(« + xf)|B, and we can conclude that 4 estimates $ up to the positive scale factor 
E|g(«+ xf)]. We discuss results of this sort further in Section 15.7.6. 

The result we just derived is of limited practical value because the assumption that 
x is multivariate normal is very restrictive. In particular, it rules out discrete variables 
and continuous variables with asymmetric distributions. In the context of the index 
model P(y = 1|x) = G(a+ xf), assuming that all elements of x are normally dis- 
tributed is very restrictive. (For example, it rules out some staples of empirical work, 
such as quadratics and interactions, in addition to discrete explanatory variables.) 
Nevertheless, the result does show that estimation of an LPM can consistently esti- 
mate average partial effects, and it may be a good estimator for moderate deviations 
of x from normality. 


Example 15.2 (Married Women’s Labor Force Participation): We now estimate 
logit and probit models for women’s labor force participation. For comparison, we 
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Table 15.1 
LPM, Logit, and Probit Estimates of Labor Force Participation 


Dependent Variable: in/f 


LPM Logit Probit 
Independent Variable (OLS) (MLE) (MLE) 
nwifeinc —.0034 —.021 —.012 
(.0015) .008) .005) 
educ 038 .221 31 
(.007) .043) .025) 
exper .039 .206 123 
(.006) .032) .019) 
exper? —.00060 —.0032 —.0019 
(.00019) .0010) .0006) 
age —.016 —.088 —.053 
(.002) .015) .008) 
kidslt6 —.262 —1.443 —.868 
(.032) (0.204) .119) 
kidsge6 013 .060 .036 
(.013) (.075) .043) 
constant 586 425 .270 
(.151) (.860) .509) 
Number of observations 753 753 753 
Percent correctly predicted 73.4 73.6 73.4 
Log-likelihood value — —401.77 —401.30 
Pseudo-R-squared .264 .220 221 


report the linear probability estimates. The results, with standard errors in parenthe- 
ses, are given in Table 15.1 (for the LPM, these are heteroskedasticity-robust). 

The estimates from the three models tell a consistent story. The signs of the co- 
efficients are the same across models, and the same variables are statistically signifi- 
cant in each model. The pseudo-R-squared for the LPM is just the usual R-squared 
reported for OLS; for logit and probit the pseudo-R-squared is the measure based on 
the log likelihoods described previously. In terms of overall percent correctly pre- 
dicted, the models do equally well. For the probit model, it correctly predicts “‘out of 
the labor force” about 63.1 percent of the time, and it correctly predicts ‘in the labor 
force” about 81.3 percent of the time. The LPM has the same overall percent cor- 
rectly predicted, but there are slight differences within each outcome. 

As we emphasized earlier, the magnitudes of the coefficients are not directly com- 
parable across the models, although the ratios of coefficients on the (roughly) con- 
tinuous explanatory variables are. For example, the ratios of the nwifeinc and educ 
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coefficients are about —.0895, —.0950, and —.0916 for the LPM, logit, and probit 
models, respectively. 

If we evaluate the standard normal probability density function, (Ê; + B,x1 +- 
+ B,.xx), at the average values of the independent variables in the sample (including 
the average of exper”), we obtain about .391. (This value is close to .4 because the 
outcomes on in/f are fairly balanced: about 56.7% of the women report being in the 
labor force. When one outcome is much more likely than the other, the scale factor 
tends to be much smaller.) We can multiply the coefficients on the roughly continu- 
ous variables—and maybe even the variables measuring the number of children—to 
get an estimate of the effect of a one-unit increase on the response probability, start- 
ing from the mean values. The scale factor for computing the PEAs for logit is about 
.243, which is close to .25, the value of the logistic density at zero. If we use these 
scale factors to compute the PEA for, say, nwifeinc, we get about —.0051 for logit 
and about —.0047 for probit. Both are somewhat larger in magnitude than the LPM 
effect, which is —.0034. 

The scale factors for the APEs are smaller. For probit, the average of 
(Bo + Êixa +++: + Bexix) across i is about .301, and for logit it is .179. When we 
multiply these scale factors by the nwifeinc coefficients we get —.0038 for logit and 
—.0036 for probit. A bootstrap standard error for the probit estimate, using 500 
bootstrap samples, is about .0016. As expected, the APEs are much closer to the 
LPM effect, and the bootstrap standard error of the APE is very close to the LPM 
standard error for the nwifeinc coeffcient, .0015. The same is true of the APEs for the 
other explanatory variables. Even for a discrete variable such as kidslt6—of which 
over 96% of the sample takes on zero or one—the APE for the logit model is about 
—.258, and for the probit it is —.261 (bootstrap standard error = .033). The partial 
effect in the LPM is —.262 (standard error = .032). 

Potentially the biggest difference between the LPM model, on the one hand, and 
the logit and probit models on the other is that the LPM implies constant marginal 
effects for educ, kidlt6, and so on, while the logit and probit models allow for a 
diminishing effect—for continuous and discrete variables. For example, in the LPM, 
one more young child, whether going from zero to one or from one to two, Is esti- 
mated to reduce the labor force participation of a women by .262, independent of the 
other income, education, age, and experience of the woman. We just saw that this is a 
good estimate of the average effect in the sense that it is similar to the estimates for 
logit and probit (when kidst/t6 is treated as a continuous variable). But the effect 
might differ across the population, and certainly the effect of having the first young 
child and the second young child might be different. To get an idea of how much 
a partial effect might differ from the APE, take a women with nwifeinc = 20.13, 
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educ = 12.3, exper = 10.6, age = 42.5—-which are roughly the sample averages— 
and kidsge6 = 1. In the probit model, what is the estimated fall in the probability 
of being in the labor force when going from zero to one small child? We evaluate 
the standard normal cdf, (By + Byx1 +--+- + Bgxx) at kidslt6 = 1 and kidslt6 = 0, 
with the other explanatory variables set at the values just given. We get, roughly, 
373 — .707 = —.334, which mean a .334 drop in the probability of being in the labor 
force. (The scaled coefficient for the PEA is about —.347, and so it is not much dif- 
ferent.) This estimated effect is substantially larger than the constant effect obtained 
from the LPM. If the woman goes from one young child to two, the probability falls 
even more, but the marginal effect is not as large: .117 — .373 = —.256. 

If we compute the difference in predicted probabilities for each woman at one and 
zero young children and then average these—that is, if we compute the APE in going 
from zero to one young child—the estimate is about —.272, while the APE in going 
from one to two is about —.220. The LPM estimate is close to the first estimate, 
which makes sense when interpreting the LPM as estimates of the average partial 
effect: less than 4 percent of women have more than one young child, and so the 
estimated effect of moving from one to two (and certainly from two to three) con- 
tributes very little to the APE. 


Binary response models apply with little modification to independently pooled 
cross sections or to other data sets where the observations are independent but not 
necessarily identically distributed. Often year or other time-period dummy variables 
are included to account for aggregate time effects. Just as with linear models, probit 
can be used to evaluate the impact of certain policies in the context of a natural ex- 
periment; see Problem 15.13. An application is given in Gruber and Poterba (1994). 


15.7 Specification Issues in Binary Response Models 


We now turn to several issues that can arise in applying binary response models to 
economic data. All of these topics are relevant for general index models, but features 
of the normal distribution allow us to obtain concrete results in the context of probit 
models. Therefore, our primary focus is on probit models. 


15.7.1 Neglected Heterogeneity 


We begin by studying the consequences of omitting variables when those omitted 
variables are independent of the included explanatory variables. This is also called the 
neglected heterogeneity problem. The (structural) model of interest is 


P(y = 1|x,c) = (xf + yc), (15.34) 
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where x is | x K with xı = 1 and c isa scalar. We are interested in the partial effects 
of the x; on the probability of success, holding c (and the other elements of x) fixed. 
We can write equation (15.34) in latent variable form as y* = xB + yc + e, where 
y = l[y* > 0] and e|x,c ~ Normal(0,1). Because x; = 1, E(c) = 0 without loss of 
generality. 

Now suppose that ¢ is independent of x and c ~ Normal(0, 1). (Remember, this 
assumption is much stronger than Cov(x, c) = 0 or even E(c| x) = 0: under indepen- 
dence, the distribution of c given x does not depend on x.) Given these assumptions, 
the composite term, yc + e, is independent of x and has a Normal(0, y?t? + 1) dis- 
tribution. Therefore, 


P(y = 1 |x) = P(yc +e > —xf|x) = ®(xf/o), (15.35) 


where o? = y?r? + 1. It follows immediately from equation (15.35) that probit of y 
on x consistently estimates B/c. In other words, if Ê is the estimator from a probit of 
y on x, then plim Ê, = B;/o. Because o = (y?r? + 1)! > 1 (unless y = 0 or 1? = 0), 
IBi/o| < |Bil- 

The attenuation bias in estimating £, in the presence of neglected heterogeneity has 
prompted statements of the following kind: “In probit analysis, neglected heteroge- 
neity is a much more serious problem than in linear models because, even if the 
omitted heterogeneity is independent of x, the probit coefficients are inconsistent.” 
We just derived that probit of y on x consistently estimates B/o rather than $, so 
the statement is technically correct. However, we should remember that, in nonlinear 
models, we usually want to estimate partial effects and not just parameters. For the 
purposes of obtaining the directions of the effects or the relative effects of the con- 
tinuous explanatory variables, estimating B/c is just as good as estimating £. 

To be more precise, the scaled coefficient, f;/o, has the same sign as f;, and so we 
will correctly (with enough data) determine the direction of the partial effect of any 
variable—discrete, continuous, or some mixture—by estimating the scaled coef- 
ficients. Further, the ratio of any two scaled coefficients, £;/o and f,/c, is simply 
B;/B,. For a continuous x;, the partial effect in model (15.34)—which we might call 
the “structural partial effect” —is 


OP(y = 1 |x, ¢)/0xj = BiO(xB + yc). (15.36) 


Therefore, the ratio of partial effects for two continuous variables x; and x; is simply 
B;/B,—the same as the ratio of scaled coefficients. 

Is there any quantity of interest we cannot estimate by not being able to estimate 
B? Yes, although its importance is debatable. Because c is normalized so that 
E(c) = 0, we might be interested in the partial effect in (15.36) evaluated at c = 0, 
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which is simply £;(xf). It is clear that we would need to consistently estimate £ in 
order to estimate the partial effect at the mean value of heterogeneity (and any value 
of the covariates). What we consistently estimate from the probit of y on x is 


(B;/2)$(xB/o). (15.37) 


This expression shows that, if we are interested in the partial effects evaluated at 
c = 0, then probit of y on x does not do the trick. An interesting fact about expres- 
sion (15.37) is that, even though £,/ø is closer to zero than f;, $(xB/c) is larger than 
@(xP) because ¢(z) increases as |z| — 0, and c > 1. Therefore, for estimating the 
partial effects in equation (15.36) at c= 0, it is not clear for what values of x an 
attenuation bias exists. Plugging other values of c into equation (15.37) makes little 
sense, as we cannot identify y, and even if we could, we generally know nothing about 
the distribution of c other than its mean is zero. Indeed, c is typically meant to cap- 
ture such nebulous concepts as “ability,”’ “health,” or “taste for saving,” and without 
further information we have no hope of estimating the distribution of these unob- 
servable attributes. 

With c having a normal distribution in the population, the partial effect evaluated 
at c = 0 describes only a small fraction of the population. (Technically, P(c = 0) = 0.) 
Instead, we can estimate the average partial effect (APE), where we now average out 
the unobserved heterogeneity and are left with a function of x. In particular, the APE 
is obtained, for given x, by averaging equation (15.36) across the distribution of c in 
the population. For emphasis, let x° be a given value of the explanatory variables 
(which could be, but need not be, the mean value). When we plug x° into equation 
(15.36) and take the expected value with respect to the distribution of c, we get 


ELB; AP + yc)] = (B;/0)9(x°B/0). (15.38) 


In other words, probit of y on x consistently estimates the average partial effects, 
which is usually what we want. 

The result in equation (15.38) follows from the general treatment of average partial 
effects in Section 2.2.5. In the current setup, there are no extra conditioning variables, 
w, and the unobserved heterogeneity is independent of x. It follows from equation 
(2.35) that the APE with respect to x;, evaluated at x°, is simply 0E(y|x°)/éx;. But 
from the law of iterated expectations, E(y |x) = E.[®(xf + yc)] = ®(xf/c), where 
E.(-) denotes the expectation with respect to the distribution of c. The derivative of 
(xf/c) with respect to x; is (2;/0)@(xB/c), which is what we wanted to show. 

The bottom line is that omitted heterogeneity in probit models is not a problem 
when it is independent of x: ignoring it preserves the signs of all partial effects, gives 
the same relative effects for continuous explanatory variables, and provides consis- 
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tent estimates of the average partial effects. Of course, the previous arguments hinge 
on the normality of c and the probit structural equation. If the structural model 
(15.34) were, say, logit and if c were normally distributed, we would not get a probit 
or logit for the distribution of y given x; the response probability is more compli- 
cated. The lesson from Section 2.2.5 is that we might as well work directly with 
models for P(y = 1|x) because partial effects of P(y = 1|x) are always the average 
of the partial effects of P(y = 1 |x, c) over the distribution of c. 

If c is correlated with x or is otherwise dependent on x (for example, if War(c| x) 
depends on x), then omission of c is serious. In this case we cannot get consistent 
estimates of the average partial effects. For example, if c|x ~ Normal(x6,77), then 
probit of y on x gives consistent estimates of (B + yd)/p, where p? = y?y? + 1. Un- 
less y = 0 or ô = 0, we do not consistently estimate B/c. This result is not surprising 
given what we know from the linear case with omitted variables correlated with the 
x;. We now study what can be done to account for endogenous variables in probit 
models. 


15.7.2 Continuous Endogenous Explanatory Variables 


We now explicitly allow for the case where one of the explanatory variables is cor- 
related with the error term in the latent variable model. One possibility is to estimate 
an LPM by 2SLS. This procedure is relatively easy and might provide a good esti- 
mate of the average effect. 

If we want to estimate a probit model with endogenous explanatory variables, we 
must make some fairly strong assumptions. In this section we consider the case of a 
continuous endogenous explanatory variable. 

Write the model as 


Yi = 110, + 41V + u1 (15.39) 
Y2 = 202) + Z222 + V2 = Zô2 + V2 (15.40) 
y= 1[yi > 0], (15.41) 


where (ui, v2) has a zero mean, bivariate normal distribution, and is independent 
of z. Equation (15.39), along with equation (15.41), is the structural equation; equa- 
tion (15.40) is a reduced form for y,, which is endogenous if u; and v2 are correlated. 
If u; and vz are independent, there is no endogeneity problem. Because v2 is nor- 
mally distributed, we are assuming that y, given z is normal; thus y, should have 
features of a normal random variable. (For example, y, should not be a discrete 
variable.) 
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The model is applicable when y, is correlated with u; because of omitted variables 
or measurement error. It can also be applied to the case where y, is determined jointly 
with y,, but with a caveat. If y; appears on the right-hand side in a linear structural 
equation for y,, then the reduced form for y, cannot be found with vz having the 
stated properties. However, if yý appears in a linear structural equation for y, then 
yə has the reduced form given by equation (15.40); see Maddala (1983, Chap. 7) for 
further discussion. 

The normalization that gives the parameters in equation (15.39) an average partial 
effect interpretation, at least in the omitted variable and simultaneity contexts, is 
Var(u) = 1, just as in a probit model with all explanatory variables exogenous. To 
see this point, consider the outcome on y, at two different outcomes of y, say y9 
and yf + 1. Holding the observed exogenous factors fixed at z?, and holding u; fixed, 
the difference in responses is 


1[zPd) + a1 (vy + 1) +u = 0] — 1[zPd) + ay? + u > 0]. 


(This difference can take on the values —1, 0, and 1.) Because u; is unobserved, we 
cannot estimate the difference in responses for a given population unit. Nevertheless, 
if we average across the distribution of u1, which is Normal(0, 1), we obtain 


D[z7d1 + x (y7 + 1] — (zP + my3). 


Therefore, ô; and a are the parameters appearing in the APE. (Alternatively, if we 
begin by allowing c? = Var(u1) > 0 to be unrestricted, the APE would depend on 
ô /a, and a /o), and so we should just rescale u; to have unit variance. The variance 
and slope parameters are not separately identified, anyway.) The proper normali- 
zation for Var(u,) should be kept in mind, as two-step procedures, which we cover in 
the following paragraphs, only consistently estimate 0; and a up to scale; we have to 
do a little more work to obtain estimates of the APE. 

The most useful two-step approach is a control function approach due to Rivers 
and Vuong (1988), as it leads to a simple test for endogeneity of y,. To derive the 
procedure, first note that, under joint normality of (u1, v2), with Var(u,) = 1, we can 
write 


u; = Ov. + ej, (15.42) 


where 0; = 7/75, nı = Cov(v2,m1), t = Var(v2), and e; is independent of z and 
v (and therefore of y,). Because of joint normality of (u1, v2), e; is also normally 
distributed with E(e;)=0 and Var(e,) = Var(m) — n? /t} = 1 — p?, where p; = 
Corr(v2, u1). We can now write 
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yi = 2101 + MH y2 + 0102 + 61, (15.43) 
ei |Z, ¥2,02 ~ Normal(0, 1 — pî). (15.44) 
A standard calculation shows that 

P(y; = 1 |z, Yo, v2) = O[(z181 + a y2 + v2) /(1 — p7)"”7). 


Assuming for the moment that we observe v2, then probit of yı on z1, y2, and v2 con- 
sistently estimates 6,1 = ô1/(1 — p?) "?, ap = a1 /(1—p?)'/?, and 0,1 = 0,/(1—p?)'. 
Notice that because p? < 1, each scaled coefficient is greater than its unscaled coun- 
terpart unless y, is exogenous (p; = 0). 

Since we do not know 02, we must first estimate it, as in the following procedure: 


Procedure 15.1: (a) Run the OLS regression y, on z and save the residuals 6. 
(b) Run the probit y,; on Z1, y2, 62 to get consistent estimators of the scaled co- 
efficients 6,1, %1, and 0,,. 


A nice feature of Procedure 15.1 is that the usual probit ¢ statistic on 62 is a valid 
test of the null hypothesis that y, is exogenous, that is, Ho : 0; = 0. If 0; # 0, the 
usual probit standard errors and test statistics are not strictly valid, and we have only 
estimated 6; and « up to scale. The asymptotic variance of the two-step estimator 
can be derived using the M-estimator results in Section 12.5.2; see also Rivers and 
Vuong (1988). Problem 15.15 asks you to obtain the correct variance matrix and to 
show that it is always greater than the incorrect one, which ignores estimation of 62, 
when 0; 4 0. The bootstrap can be used, but some bootstrap samples may have little 
variation in y, if one outcome is much more likely. 

Under Ho : 0; = 0, e} = u1, and so the distribution of vz plays no role under the 
null. Therefore, the test of exogeneity is valid without assuming normality or homo- 
skedasticity of v2, and it can be applied very broadly, even if y, is a binary variable. 
Unfortunately, if y, and u; are correlated, normality of v2 is crucial. 


Example 15.3 (Testing Exogeneity of Education in the Women’s LFP Model): We 
test the null hypothesis that educ is exogenous in the married women’s labor force 
participation equation. We first obtain the reduced form residuals, #2, from regressing 
educ on all exogenous variables, including motheduc, fatheduc, and huseduc. Then, we 
add #2 to the probit from Example 15.2. The ¢ statistic on #2 is only .867, which is weak 
evidence against the null hypothesis that educ is exogenous. As always, this conclusion 
hinges on the assumption that the instruments for educ are themselves exogenous. 


After the two-step estimation, we can easily obtain estimates of the unscaled 
parameters, 24 = (ô, 1)’, which then allows us to estimate partial effects. From the 
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two-step estimation procedure, we have consistent estimators of 6. and t} from the 
first-stage regression, and then 0,1, a1, and 0,ı from the second stage. Straightfor- 
ward algebra shows that 1 + G3 = 1/(1 — p?). Now, because 6; = (1 — p?) "76,1 
and «i = (1— p?) ap, it follows that fı = By/A+ CA ee where f, = 
(ôn, p1) is the vector of scaled coefficients. Therefore, we can obtain consistent 
estimators of the original coefficients as 


Îi = By / 0 + 683)", (15.45) 


where all quantities on the right-hand side of equation (15.45) are available from the 
two-step estimation procedure. 

Given 6; and &, we can compute derivatives and differences in ®(z)d, + &y2) at 
interesting values of z; and y,. Sometimes it is useful to evaluate the partial effects at 
the mean values. At other times the average partial effects are preferable. For con- 
tinuous explanatory variables (including y»), the scale factor for multiplying, say, &ı 
is N~! ee (2116 + 012). We can use the delta method to obtain standard errors 
for the APEs (or partial effects at the average), but the required calculations are 
tedious, given the two-step nature of the estimation. The bootstrap is easily applied 
(but computationally expensive): for each bootstrap sample, one computes the two 
steps of the procedure and obtains the scaled and unscaled coefficients, and then 
computes the APEs (or other partial effects of interest). As described in Section 
12.8.2, one computes the standard deviation across the bootstrap replications. 

An alternative method of computing the APEs does not exploit the normality 
assumption for v2. We can use the results in Section 2.2.5 by writing yı = 1[z1ô1 + 
xiy +u, > 0] and setting q = u1, x = (Z1, y2), and w= vy» (a scalar in this case). 
Because y, is a deterministic function of (z1, y2,u1), v2 is trivially redundant in 
E(y1 | 21, ¥2, 41, 02). Further, we have already used that D (u | 21, ¥2, v2) = D(u | v2), 
and so assumption (2.33) holds as well. It follows that APEs are obtained by taking 
derivatives or differences of 


E,,[P(Z1dp1 + %p1y2 + Op1v2)] (15.46) 


with respect to elements of (z1, y2). Using Lemma 12.1, a consistent estimator of 
(15.46) is given by 


N 
N! X O(1)p1 + Gp V2 + Opi 612), (15.47) 


i=1 


where the ô;2 are the first-stage OLS residuals from regressing y; on z; i= 1,..., N. 
This approach provides a different strategy for estimating APEs: simply compute 
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partial effects with respect to zı and y, after the second-stage estimation, but then 
average these across the 6,2 in the sample. For a continuous explanatory variable, y, 
often being the most important, the estimated APE is &[N~! ou 1 (2:18 p1 + Opi yi2 + 
9,1672)], and a standard error for this APE can be obtained via the delta method or 
the bootstrap. Again, the bootstrap is much easier to apply: for each bootstrap sam- 
ple, one applies the two-step estimation method and computes the APEs for that 
sample. The process is repeated to obtain a bootstrap standard error for the APEs. 

The control function (CF) approach has some decided advantages over another 
two-step approach—one that appears to mimic the 2SLS estimation of the linear 
model. Rather than conditioning on vz along with z (and therefore y,) to obtain 
P(y, = 1] z, v2) = P(y, = 1|z, yo, 02), we can obtain P(y, = 1|z). To find the latter 
probability, we plug in the reduced form for y, to get y; = 1[z16) + xı (zd2) + %v2 + 
u; > 0]. Because «v2 + u; is independent of z and (u,v) has a bivariate normal 
distribution, P(y, = 1|z) = ®{[z16, + «1(zd2)|/@}, where œ? = Var(ov2 + u1) = 
arts + 1+ 20; Cov(v2,u1). (A two-step procedure now proceeds by using the same 
first-step OLS regression—in this case, to get the fitted values, f; = z)6.—now fol- 
lowed by a probit of y;, on Za, Jj. It is easily seen that this method estimates the 
coefficients up to the common scale factor 1/@,, which can be any positive value 
(unlike in the CF case, where we know the scale factor is greater than unity). 

As with the CF approach, getting the appropriate standard errors is difficult. A 
primary drawback of the method that inserts f, for y, is that it does not provide 
a simple test of the null hypothesis that y, is exogenous. Plus, the coefficients cannot 
be directly compared to the usual probit estimates in a Hausman test because of the 
different scale factors. Additionally, while the APEs can be recovered in much the 
same way as they can be for the CF approach, equation (15.47) is not available for 
the fitted values approach. Finally, plugging in fitted values is strictly limited to the 
structural equation (15.39); adding other functions of y, is very cumbersome, and 
one is prone to making mistakes. See Problem 15.14 and the discussion at the end of 
this subsection. 


Example 15.3 (Endogeneity of Nonwife Income in the Women’s LFP Model): We 
use the data in MROZ.RAW to test the null hypothesis that nwifeinc is exogenous in 
the probit model estimated in Table 15.1. We use as an instrument for nwifeinc hus- 
band’s years of schooling, Auseduc. Therefore, the identification assumption (perhaps 
a tenuous one) is that husband’s schooling is unrelated to factors that affect a married 
woman’s labor force decision once nwifeinc and the other variables (including the 
woman’s education) are accounted for. In the first-stage regression of nwifeinc on 
huseduc and the other explanatory variables listed in Table 15.1, the fully robust t 
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statistic on huseduc is about 6.92, which is hardly surprising, because nwifeinc is 
pretty highly correlated with husband’s labor earnings, which in turn depends on 
husband’s education. When the reduced form residuals, #2, are added to the probit, 
its coefficient is about .027 with a f¢ statistic of about 1.41—only moderate evidence 
that nwifeinc is endogenous. The coefficient on nwifeinc becomes about —.037, which 
is certainly larger in magnitude than the probit estimate in Table 15.1. But we must 
compare partial effects. If we average the scale factor (Zid p1 + dpi Vir + Op 612) 
across all 7 (so that we are averaging across the distribution of (z1, y2) in addition 
to averaging out v2), we obtain about .300. Therefore, the APE for nwifeinc is 
.300(—.037) x —.011 with a boostrap standard error, based on 500 bootstrap sam- 
ples, of about .0058. The estimate, which is marginally statistically significant based 
on the bootstrap standard error, is about three times larger than the probit estimate 
that treats nwifeinc as exogenous, which was —.0036. If we use equation (15.46) to 
recover the estimates of the original coefficients, «; = —.0355, and this agrees with 
the joint MLE (as it should because the model is just identified). The scale factor 
based on (2:16 + ĉi yn) is about .297, which is very close to the scale factor based 
on (15.47). For comparision, we estimate a linear probability model by 2SLS. The 2SLS 
coefficient on nwifeinc is about —.012 (robust standard error = .0059), which, for 
practical purposes, is the same as the APE for the CF estimate in the probit model. 

In the end, we are left in a difficult situation: the evidence against exogeneity of 
nwifeinc is not particularly strong, yet the APE that treats nwifeinc as endogenous is 
quite a bit larger than the APE that treats nwifeinc as exogenous. 


In the previous example the model is just identified because we have one instru- 
ment, huseduc, for the endogenous variable nwifeinc. If we have overidentifying 
restrictions, these are easily tested using the CF approach. The key restriction is that 
D(u | ¥2,z) = D(u | v2); that is, the conditional distribution depends only on the 
linear combination y — zô2. If z is the 1 x L2 vector of exogenous variables 
excluded from the structural equation, then this restriction on D(w | y2, Z) imposes 
L> — 1 overidentifying restrictions. We can test these by including any 1 x (L2 — 1) 
subvector of z2, say ho, as additional explanatory variables in part (b) of Procedure 
15.1, and conduct a joint significance test. In effect, we are testing E(hje;) = 0, where 
ui = pıv2 + e:. (Naturally, if we are allowing y, to be endogenous under the null—as 
we should—then the first-step estimation of 62 should be accounted for, either via the 
delta method or the bootstrap.) It can be shown that the test is invariate to which 
subset of z2 we choose for h2, provided hy has Lz — 1 elements. 

Rather than use a two-step procedure, we can estimate equations (15.39)—(15.41) 
by conditional maximum likelihood estimation (CMLE). To obtain the joint distri- 
bution of (y1, y2), conditional on z, recall that 
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(vi 9212) = £11 2. DA (92 | 2) (15.48) 


(see Property CD.2 in Appendix 13A). Since y,|z ~ Normal(z62, 7), the density 
f(y |Z) is easy to write down. We can also derive the conditional density of y, given 
(¥7,Z). Since v2 = yy — Zdz and yı = 1[yř > 0J, 


210 + %1 2 + (Pi /T2)(Y2 — 202) 
a-p) 


where we have used the fact that 0) = p,/t2. 
Let w denote the term in inside ®(-) in equation (15.49). Then we have derived 


f (V1, V2 12) = {PWF {1 — P(w) F (1/12) 4[(92 — 282)/22), 


and so the log likelihood for observation i (apart from terms not depending on the 
parameters) is 


Ya log B(w;) + (1 — ya) log[1 — P(w:)] — $ log(t3) — 3 (ia — 2182)" /t3, (15.50) 


where we understand that w; depends on the parameters (ô1, %1, p1, 02, T2): 


PU = 1| 2,2) = © 


, (15.49) 


wi = [zað1 + Yi + (—1/t2) (Yin — 4162)]/(1 = pp)”. 
Summing expression (15.50) across all i and maximizing with respect to all param- 
eters gives the MLEs of ôi, %1, pi, 02, T7. The general theory of conditional MLE 
applies, and so standard errors can be obtained using the estimated Hessian, the 
estimated expected Hessian, or the outer product of the score. MLE applied to this 
model sometimes goes by the name of instrumental variables probit, or IV probit. 

Maximum likelihood estimation has some decided advantages over two-step pro- 
cedures. First, MLE is more efficient than any two-step procedure. Second, we get 
direct estimates of 6, and «ı, the parameters of interest for computing partial effects. 
Evans, Oates, and Schwab (1992) study peer effects on teenage behavior using the full 
MLE. Of course, obtaining standard errors for the APEs still requires using the delta 
method or the bootstrap. 

Testing that y, is exogenous is easy once the MLE has been obtained: just test 
Ho : p; = 0 using an asymptotic ¢ test. We could also use a likelihood ratio test. 

The drawback to the MLE is computational. Sometimes it can be difficult to get 
the iterations to converge, as f} sometimes tends toward 1 or —1. 

Comparing the Rivers-Vuong approach to the MLE shows that the former is a 
limited information procedure. Essentially, Rivers and Vuong focus on f(y; | ,Z), 
where they replace the unknown ô with the OLS estimator ô» (and they ignore the 
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rescaling problem by taking e; in equation (15.43) to have unit variance). MLE esti- 
mates the parameters using the information in f(y,|y.,z) and f(y,|z) simulta- 
neously. For the initial test of whether y, is exogenous, the Rivers-Vuong approach 
has significant computational advantages. If exogeneity is rejected, it might be worth 
doing MLE. As shown by Rivers and Vuong (1988), the two-step estimators and 
MLEs are identical when the model is just identified. 

Another benefit of the MLE approach for this and related problems is that it forces 
discipline on us in coming up with consistent estimation procedures and correct 
standard errors. It is easy to abuse two-step procedures if we are not careful in 
deriving estimating equations. With MLE, although it can be difficult to derive joint 
distributions of the endogenous variables given the exogenous variables, we know 
that, if the underlying distributional assumptions hold, consistent and efficient esti- 
mators are obtained. 

The CF approach has a somewhat subtle robustness property. It is easy to see that 
the CF approach is consistent if we assume that D(w |z,v2) = D(u | v2) and that 
D(w | v2) is normal with mean linear in vz and constant variance. Independence be- 
tween (u1, v2) and z along with bivariate normality of (u1, v2) is sufficient but not 
necessary. It is certainly possible for D(u; | v2) to be normal without v having a 
normal distribution. In fact, we could even allow the mean E(u; | v2) to be a known, 
nonlinear function of v,—say, a quadratic—and then add these functions to the 
probit in the second-step estimation. 

For notational and interpretational simplicity, our previous analysis has assumed 
that a single endogenous explanatory variable appears additively inside the index 
function. But a moment’s thought shows that the previous analysis goes through 
much more generally. For example, suppose we specify 


yı = 1 gi (z1, Y2)B) + um > 0] 
y2= g2(z)ô2 T v2, 


where g,(-) and g,(-) are known vector functions. As we discussed in Section 6.2, CF 
approaches offer considerable flexibility in specifying functional form—provided the 
underlying assumptions hold. Because y, is a function of (z, v2), if we maintain in- 
dependence between (u1, v2) and z, then D(u |Z, y5,v2) = D(w | z, v2) = D(u | v2), 
and the previous analysis, whether it is the Rivers-Vuong CF approach or MLE— 
goes through by replacing x; = (z), y2) with x) = g)(z), y2). For example, we can 
choose g)(Z1, y2) = (Z1, ¥2Z) to allow full interaction effects (in addition to a main 
effect, since the first element of z; should be unity). Adding a quadratic or other 
polynomials in y, causes no difficulties. Partial effect and APE calculations become 
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more complicated only insofar as derivatives of the function g,(zi, y2) must be 
obtained. So, for example, in a model with y; = 1[z\d) + ypz10) + yy +u > 0], 
the partial effect of y, is (z1aı + 27,y2)b(z1d) + y2z1a1 + 71y). All of the parameters 
can be estimated using the two-step CF estimator or the MLE. Because of our 
assumptions, testing for exogeneity proceeds exactly as before: we can use a simple ¢ 
statistic on 0,2 in the second-step probit (or directly test for zero correlation between 
u; and vz in the context of MLE). (Standard software packages that have an “in- 
strumental variables probit” command are easily tricked into estimating general 
models, provided y, appears on its own as an explanatory variable—such as models 
with quadratics and interactions in y). One simply specifies y, as the lone endoge- 
nous explanatory variable with other functions of y, listed as if they were exogenous; 
the likelihood function will be properly computed.) 

Another nice feature of the CF approach is that we can allow for a strictly mono- 
tonic transformation of the endogenous explanatory variable in the reduced form. 
For example, suppose y, = log(income). Then, of course, income = exp( y>), and 
so any function of income—say, its level, quadratic, or interaction terms—can 
appear in the structural equation if these are deemed better functional forms than 
log(income) in the probit. This flexibility is handy because sometimes a variable 
needs to be transformed before it is reasonable to assume that it has a reduced form 
with an additive error that is independent of z. As another example, if w is an 
endogenous variable that is strictly in the unit interval, we might choose y, = 
log|w2/(1 — w2)] as the variable that has a reduced form linear in parameters with an 
additive error. Then w itself, and any function of it, can appear in the probit model 
for y, because wz is a well-defined function of yz. 

That including the reduced form residuals in the CF approach accounts for endo- 
geneity of y,, even if we have general functions of y, and zı in the model, hinges 
crucially on independence between (uj, v2) and z. Because of the additivity of v in 
the reduced form, independence between v2 and z pretty much rules out any dis- 
creteness in y). And, we are assuming at least normality of D(w | v2) (although that 
can be relaxed, as we briefly discuss in Section 15.7.5). 

If we have multiple exogenous explanatory variables, say a 1 x G] row vector yp, 
we can solve the endogeneity problem in a very similar way. Maximum likelihood 
estimation becomes much more difficult computationally, but the CF approach is 
straightforward. The key assumption for the CF approach is y, = ZA? + v2, where 
D(u |z, v2) = D(w | v2) and this latter distribution is normal with constant variance 
and mean linear in vz. Calculation of partial effects requires averaging out vo, either 
by computing the unscaled coefficients or using the extension of equation (15.47). 
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Maximum likelihood estimation is also feasible when (w,v2) is jointly normal and 
independent of z. 

Finally, one should not minimize the usefulness of estimating a linear model for y, 
using standard IV estimation, probably 2SLS. As we discussed in Section 15.6 with 
exogenous explanatory variables, the OLS estimates of an LPM can provide good 
estimates of the APEs. The same is true when one (or more) explanatory variable is 
endogenous. At a minimum, it makes sense to compare 2SLS estimates of an LPM 
with the APEs obtained from the probit model with endogenous y3. 


15.7.3 Binary Endogenous Explanatory Variable 


We now consider the case where the probit model contains a binary explanatory 
variable that is endogenous. We begin with the simplest model, 


yı = 1[z)6) + %1 y3 + u > 0] (15.51) 
Yo = 1[zd. + v2 > 0], (15.52) 


where (u1, v2) is independent of z and distributed as bivariate normal with mean zero, 
each has unit variance, and p; = Corr(m, v2). If p} #0, then u; and y, are corre- 
lated, and probit estimation of equation (15.51) is inconsistent for 6; and g1. 

Model (15.51) and (15.52) applies primarily to omitted variables situations. In 
particular, we could not obtain the reduced form (15.52) if the structural model has 
yı as a determinant of y. Measurement error in binary responses does not lead to 
this model, either. 

As discussed in Section 15.7.2, the normalization Var(w;) = 1 is the proper one for 
computing partial effects. Often, the effect of y, is of primary interest, especially 
when y, indicates participation in some sort of program, such as job training, and the 
binary outcome y, might denote employment status. The average treatment effect 
(for a given value of z,) is ®(z,6; + %1) — O(z,d,). This effect can be computed for 
different subgroups or averaged across the distribution of z4. 

To derive the likelihood function, we again need the joint distribution of (y1, y2) 
given z, which we obtain from equation (15.48). To obtain P(y; = 1| y5,z), first note 
that 


P(y, = 1 | 02,2) = @[(218) + %12 + pia) /(1 = pt)”. (15.53) 


Since y, = 1 if and only if v) > —zd2, we need a basic fact about truncated normal 
distributions: If v2 has a standard normal distribution and is independent of z, then 
the density of vz given vz > —zô is 


&(v2)/P(v2 > —202) = $(v2)/B(zd2). (15.54) 
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Therefore, 
P(y, = 1|, = 1,2) = E[P(y, = 1] 2,2) | y2 = 1,7] 


= E{[(2161 + 41y + pye2)/(1 — p?) P] | y2 = 1,23 


1 2 : 
= @(z6>) ie @[(z10) + a1 + pyv2)/(1 — Pt) 1/2) (v7) dvr, 
(15.55) 


where v2 in the integral is a dummy argument of integration. Of course, P(y, = 
0| y, = 1,z) is just one minus equation (15.55). 
Similarly, P(y; = 1| y. = 0, Z) is 


—zô2 
ra]. D[(210) + 21y + p12)/(1 = p?) A2) dv2. (15.56) 
Combining the four possible outcomes of (y,, y2), along with the probit model for 
y>, and taking the log gives the log-likelihood function for maximum likelihood 
analysis. Several authors, for example Greene (2003, Section 21.6), have noted a 
useful computational feature of the model in equations (15.51) and (15.52). To 
describe this feature—and to explain the additional flexibility it affords for extending 
the basic model—we introduce the bivariate probit model, typically specified for two 
binary responses as 


yı = 1[xif) +e > 0] 


V7 = I[xof, + e2 > 0), 


where x, is | x Kı and x3 is 1 x K>. In the traditional formulation of bivariate probit, 
the error term, e = (e1, e2), is assumed to be independent of (x), x2) with a bivariate 
normal distribution. In particular, e|x ~ Normal(0,Q), where x consists of all ex- 
ogenous variables and Q is the 2 x 2 matrix with ones down its diagonal and off- 
diagonal element p = Corr(e;,e2). These assumptions imply that y, and y, each 
follow probit models conditional on x. Therefore, f; and f, can be consistently esti- 
mated by estimating separate probit models. Not surprisingly, if e, and ez are corre- 
lated, a joint maximum likelihood procedure is more efficient than the separate 
probits. With exogenous explanatory variables, increased efficiency in estimating f; 
and £, is the main reason for a joint estimation procedure. (Incidentally, when x; = 
X2 = x, there are generally efficiency gains from joint MLE over probit on each 
equation. By contrast, as we saw in Chapter 7 for the linear model, OLS on each 
equation and feasible GLS are identical.) 


596 Chapter 15 


A simple way to obtain the log-likelihood function is to construct the joint density 
as f (yı | ¥2,x)f(¥2|x), and it is here that a useful feature of the bivariate probit as 
it relates to probit with a binary endogenous variable emerges: the form of the con- 
ditional density, f(y, | y2, X), is the same even if x; includes y,. In fact, xı can be any 
function of (z1, y2). In other words, in expressions (15.55) and (15.56) we can replace 
zi + & 2 with x1 B; = g)(Z1, y2)B,. The reason we can allow this generality is sim- 
ple: when we obtain the density of y, given (,x), we are already conditioning on 
yz, so the form of the density is the same whether or not y, is in xı. We discussed a 
similar feature of the log-likelihood function for estimating the model in Section 
15.7.2. 

The practical implication of our being able to use the bivariate probit log- 
likelihood function for endogenous explanatory variables is computational: if an 
econometrics package estimates a bivariate probit, then it can be used directly to es- 
timate the parameters in (15.51) and (15.52). Further, we can use exactly the same 
routine to estimate a model with interactions in the structural equation, y; = 1[z1ô1 + 
o2Z1@, + u1 > 0], with no more work than defining the interactions and including 
them in the appropriate command (specifying y, as the only endogenous variable but 
including the interactions with y, as additional explanatory variables). 

Evans and Schwab (1995) use model (15.51) and (15.52) (and linear probability 
models) to estimate the causal effect of attending a Catholic high school (y,) on the 
probability of attending college (yı), allowing the Catholic high school indicator to 
be correlated with unobserved factors that also affect college attendance. As an IV 
for y, they use a binary indicator of whether a student is Catholic. Recently, Altonji, 
Elder, and Taber (2005) (AET for short) have revisited this question, using affiliation 
with the Catholic Church and geographic proximity to Catholic schools as instru- 
ments. They compare 2SLS estimates of a linear model—which, as always, may 
provide good approximations to the average partial effects—to those from the 
bivariate probit. In addition to finding the instruments to be suspect, AET conclude 
that identification of the parameters can be driven largely by the nonlinearity in the 
bivariate probit model. Unlike the model in Section 15.7.2, where an exclusion re- 
striction in the structural equation is needed for identification, that is not the case in 
(15.51) and (15.52). Typically, one would be suspicious if an exclusion restriction is 
not available. AET’s results suggest that even if such restrictions are available, they 
may have little impact on the estimated average treament effect (the APE of the 
Catholic high school dummy). 

Because computing the MLE requires nonlinear optimization, it is tempting to use 
some seemingly “obvious” two-step procedures. As an example, we might try to 
inappropriately mimic 2SLS. Since E(y,|z) = ®(zd2) and ô is consistently esti- 
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mated by probit of y, on z, it is tempting to estimate 0; and « from the probit of yı 
on z, Ê», where Ê, = (26). This approach does not produce consistent parameter 
estimators, for the same reasons the forbidden regression discussed in Section 9.5 for 
nonlinear simultaneous equations models does not. For this two-step procedure to 
work, we would have to have P(y,; = 1 |z) = ®[z)6) + «P(zd2)]. But P(y; = 1 |z) = 
E(y, |z) = E(1 [2161 + a1) + uy > 0] |z), and since the indicator function 1[-] is non- 
linear, we cannot pass the expected value through. 

Greene (1998), in commenting on Burnett (1997), asserted that the two-step pro- 
cedure that uses first-stage probit fitted values in place of y, in a second-stage probit 
is consistent but inefficient. As a way to obtain more efficient estimators, he proposed 
using the full MLE. But it is important to understand that the problem with the two- 
step procedure is not just a matter of inefficiency and adjusting for the two-step esti- 
mation; the procedure does not produce consistent estimators of the parameters, and 
probably not the APEs either (although a study of this issue would be informative). 
Burnett (1997) suggested adding the probit residuals, în = yp — ®(z;ô2), in an 
attempt to mimic Rivers and Vuong’s (1988) method for a continuous y,, although 
she applied nonlinear least squares, not MLE, in the second step. Interestingly, the 
Burnett approach does provide a valid test of the null hypothesis that y, is exoge- 
nous, although it makes more sense to use probit in the second stage, too. So, do a 
probit of y; on Zi, Yp, În and obtain the usual asymptotic ¢ statistic on 72. Under 
the null hypothesis that y, is exogenous, P(y,; = 1 |z,y2) = P(y, = 1 |z, y2) = 
(zd, + #2), and so no additional functions of (yn, z;) should appear. Rather than 
use the residuals from the linear reduced form for y, (which is what the Rivers- 
Vuong test does), we can use the residuals from a probit. Unfortunately, if y, is 
endogenous, Burnett’s suggestion does not appear to consistently estimate the 
parameters or the APEs. 

As mentioned in the previous subsection, we can use the Rivers-Vuong approach 
to test for exogeneity of y). This has the virtue of being simple, and, if the test fails to 
reject, we may not need to compute the MLE. A more efficient test is the score test of 
Ho : p; = 0, and this does not require estimation of the full MLE. Of course, if one 
has computed the MLE, a ¢ test or LR test can be used. 


Example 15.4 (Women’s Labor Force Participation and Having More than Two 
Children): We use a data set from Angrist and Evans (1998), in LABSUP.RAW, to 
study the effects of having more than two children on women’s labor force partici- 
pation decisions. The population is married women in the United States who have at 
least two children. The endogenous explanatory variable is y, = morekids, which is 
unity if a woman has three or more children (about 49 percent of the sample). The 
response variable is y; = worked; roughly 59 percent of the women report being in 
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Table 15.2 
Estimated Effect of Having Three or More Children on Women’s Labor Force Participation 


Dependent Variable: worked 
(1) (2) (3) (4) (5) 


Model LPM Probit LPM Bivariate probit Bivariate probit 
Estimation method OLS MLE 2SLS: samesex MLE: samesex MLE: no IV 

as IV as IV 
Coefficient on morekids —.109 (.006) —.299 (.015) —.201 (.096) —.703 (.204) —.966 (.243) 
APE for morekids —.109 (.006) —.109 (.006) —.201 (.096) —.256 (.072) —.349 (x) 
Ê — — — .254 (.131) .426 (.162) 
Number of 31,857 31,857 31,857 31,857 31,857 
observations 


Standard errors for estimated coefficients and APEs are given in parentheses next to coefficients. For the 
nonlinear models, the APE standard errors were obtained from 500 bootstrap samplings. 

A bootstrap standard error could not be obtained for column (5) due to computational problems for some 
bootstrap samples. 


the labor force at the time of the survey. We also include the variables nonmomi 
(“non-mom” income), educ (years of schooling), age, age?, and the race indicators 
black and hispan; these are all treated as exogenous. As an instrumental variable for 
morekids we use samesex, which is a binary variable equal to one if the first two 
children are of the same sex. (While the outcome of samesex is legitimately treated as 
random, it is not necessarily exogenous to the labor force participation decision: 
having two children of the same sex can shift a family’s budget constraint because, 
for example, a bedroom is more easily shared and clothes and toys are more easily 
handed down.) Table 15.2 contains the estimation results from five different ap- 
proaches: OLS estimation of a linear probability model, probit treating morekids as 
exogenous, 2SLS estimation of an LPM, bivariate probit allowing morekids to be 
endogenous, and bivariate probit that drops samesex from the probit for morekids. 
We include the latter primarily to illustrate that the nonlinearity in the model identi- 
fies the parameters. For brevity, we only report the coefficients and average partial 
effects for morekids. For the LPMs, the standard errors are robust to arbitrary 
heteroskedasticity. 

When morekids is assumed exogenous, the LPM and probit models give the same 
estimated average partial effect of having more than two children to three decimal 
places: on average, women with more than two children are about .11 less likely to be 
in the labor force than women with two children. In addition, the standard errors are 
the same to three decimal places (and the standard error for probit was obtained 
from bootstrapping). When samesex is used as an IV in the LPM estimation, the 
effect almost doubles. Column 4 contains the estimates from the bivariate probit 


Binary Response Models 599 


where morekids is treated as endogenous and samesex is excluded from the structural 
equation but appears in the probit model for morekids—the standard case where we 
impose an exclusion restriction. Now, the nonlinear probit model gives a larger esti- 
mated APE than the linear model: —.256 versus —.201. Interestingly, according to 
the APE standard error for the bivariate probit, obtained using bootstrapping, the 
APE is more precisely estimated using the nonlinear model. 

If we drop samesex from the probit model for morekids—which, in a linear con- 
text, would lead to a lack of identification—we estimate an even larger effect: the 
APE is —.349. The estimated value of p increases substantially when we drop 
samesex from the model for morekids, suggesting that including samesex in the model 
for morekids helps reduce the correlation between unobservables that affect both 
morekids and worked. The large increase in the magnitude of the APE suggests that 
the nonlinearity in the bivariate probit plays a critical role in determining the APE, a 
point made by Altonji, Elder, and Taber (2005) when studying the effects of attend- 
ing a Catholic high school on outcomes such as attending college. Generally, it is 
dangerous to rely on nonlinearities to identify parameters and partial effects, and one 
should avoid doing so in bivariate probit models. When bootstrapping was applied to 
obtain a standard error for the APE in column (5), for some bootstrap samples the 
computational algorithm would not converge, suggesting that the model without 
an exclusion restriction is ill specified. In this application, the real issue should be 
whether the difference in APEs between the linear model estimated by 2SLS and the 
bivariate probit that uses the same instrument is important. 

We can also implement the inconsistent two-step estimation approach where the 
probit fitted values, say morekids, are plugged in for morekids in the second-stage 
probit. The coefficient on morekids is about about —.843, which is quite a bit larger in 
magnitude than the full MLE estimate when samesex is used as an IV, —.703. The 
estimated APE for the two-step procedure is about —.305, which again is larger in 
magnitude than the full MLE estimate, —.256. The difference in APEs is less than the 
difference in APEs between the linear 2SLS estimate and full MLE. It is intriguing 
to think that, somehow, the inconsistent two-step procedure might generally deliver 
reasonable estimates of the APEs, but there is no evidence on this point. 


15.7.4 Heteroskedasticity and Nonnormality in the Latent Variable Model 


In applying the probit model it is easy to become confused about the problems 
of heteroskedasticity and nonnormality. The confusion stems from a failure to 
distinguish between the underlying latent variable formulation, as in y* = xf + e, 
and the response probability in equation (15.8). As we have emphasized throughout 
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this chapter, for most purposes we want to estimate P(y = 1 |x). The latent variable 
formulation is convenient for certain manipulations, but we are rarely interested in 
E(y* |x). (Data censoring cases, where we are only interested in the parameters of an 
underlying linear model, are treated in Chapter 19.) 

Once we focus on the response probability, we can easily see why comparing non- 
normality and heteroskedasticity in the latent variable model with the same problems 
in a linear (or nonlinear) regression requires considerable care. First consider the 
problem of nonnormality of e in a probit model. If e is independent of x, we can write 
P(y = 1 |x) = G(xf) 4 ®(x), where, generally, G(z) = 1 — F(—z) and F(-) is the 
cdf of e. One often hears statements such as ‘“‘nonnormality in a probit model causes 
bias and inconsistency in the parameter estimates.” Technically, this statement is 
correct, but it largely misses the point. In Section 15.6 we noted that when x has a 
multivariate normal distribution, the LPM consistently estimates the APEs for any 
smooth function G(-). Using a linear model when the response probability is non- 
linear is a fairly serious form of misspecification, yet we consistently estimate perhaps 
the most useful quantities: the APEs. The probit model is likely to do a reasonable 
job of approximating the APES in lots of cases where G # ®. In fact, the situation 
that emerges in Example 15.2 is quite common in applications: probit and logit give 
very similar estimated partial effects. That the logit parameter estimates are larger 
than the probit estimates by roughly a factor of 1.6 is a consequence of the different 
implicit scale factors, but the different scalings are easily accounted for by computing 
partial effects. 

Worrying about the choice of G when the response probability is of interest is no 
different from worrying about the functional form for E(y | x) in a regression context. 
For example, if y > 0 and we are considering two models for E( y |x), a linear model 
E(y|x) = xf and an exponential model E(y |x) = exp(xf), we do not couch the 
choice between them in the language of “biased” or “inconsistent” parameter esti- 
mation. That OLS estimation of a linear model will not consistently estimate the 
parameters in exp(xf) is both obvious and not particularly relevant. The issue is 
whether a linear or exponential function form provides the best fit, and whether the 
estimated partial effects E(y|x) are very different across the models. In some cases, 
the linear model does provide good estimates of the APEs. We should have the same 
discussion when deciding on G in the index model P(y = 1| x) = E(y |x) = G(xf): 
our choice of G is a functional form issue. A nonnormal distribution of u in the linear 
regression model y = xf + u, E(u|x) = 0, does not change the functional form of 
E(y|x), and that is why nonnormality is harmless (provided one can rely on 
asymptotic theory). Nonnormality in e in y = 1[xf+e > 0] means that the probit 
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response probability is incorrect. Even if we could estimate f consistently we could 
not obtain magnitudes of the partial effects. So our focus should be on how well dif- 
ferent methods approximate partial effects, and not on whether they estimate 
parameters consistently. (In the next subsection, we show that relative partial effects 
of continuous variables can be identified without specifying G(-), and sometimes even 
if we allow e and x to have some dependence.) 

Once we view nonnormality of e as a functional form problem for p(x), we can 
evaluate more general parametric models in a sensible way. It can be a good idea to 
replace ®(xf) with a function such as G(xf, y), where y is an extra set of parameters, 
especially if G(xf,y) is chosen to nest the probit model. (Moon (1988) covers some 
interesting possibilities in the context of logit models, including asymmetric dis- 
tributions. See also Problem 15.16.) But generalizing functional form by making the 
distribution of e more general is not necessarily better than just specifying more 
flexible models for the response probability directly, as in McDonald (1996). And we 
definitely should not reject the standard models just because the estimates of f seem 
to change a lot: the basis for comparison should be partial effects at various values of 
x, APEs, and goodness-of-fit measures such as the values of the log-likelihood func- 
tions and the percent correctly predicted. 

Similar comments can be made about heteroskedasticity in the latent error e. Tra- 
ditionally, discussions of its implications tend to suffer from a failure to specify what 
it is we hope to learn when we estimate binary response models. For example, one 
often sees reference to Yatchew and Griliches (1984) concerning the inconsistency 
of the probit MLE when Var(e|x) depends on x, even if D(e|x) is normal. If 
D(e|x) = Normal(0,/(x)), the response probability takes the form P(y = 1|x) = 
O[xB /h(x) 1/ *|, and so it is pretty obvious that probit of y on x could not consistently 
estimate f. Some authors have noted this finding is particularly troubling because 
“heteroskedasticity is prevalent with microeconomic data.” But what does this mean? 
Even if the usual probit model is correct—so that Var(e| x) = 1—the variance of the 
response variable, y, is heteroskedastic: Var(y|x) = ®(xf)[1 — ®(x)]. In fact, most 
observed discrete random variables y will have conditional distributions such that 
Var(y |x) is not constant, but that is due to the discreteness in y, not necessarily be- 
cause of heteroskedasticity in an underlying latent error. 

Is there any reason we should consider possible heteroskedasticity in Var(y* | x) 
when y* is an unknowable latent variable? There are two reasons. The simplest is for 
generalizing the functional form of the response probability—just as when we allow 
for nonnormality in D(y* | x). In Section 15.5.3 we discussed testing the probit model 
against an alternative where Var(e|x) = exp(2x)6) for x; a subset of x (which, at a 
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minimum, excludes a constant). Then P(y = 1|x) = ®[exp(—x10d)xf], and so we have 
potentially allowed a much more flexible response probability. But there is a cost to 
the more general model: the partial effects, ôP(y = 1 | x)/éx; are more complicated. 
If ô; is the coefficient on x; in the vector x), then 


66P(y = 1|x) 
Ox; 


= dlexp(—x16)xB] exp(—x16)[B; — ò; - (xB)], (15.57) 


and so the sign of the partial effect at x depends on the sign of B; — 6; - (xf); it need 
not be the same sign as f;. (We can actually find models where the partial effect 
is always the opposite sign from the coefficient. If y = l[o + f)x1 +e > 0] and 
D(e|x) = Normal(0, x7), then P(y = 1| x1) = ®(Bo/x1 + £1), and so 6P(y = 1 | x1)/ 
0x1 = —(Bo/x7)0(Bo/x1 +81). If By > 0 and £; >0, the partial effect of x; on 
P(y = 1| x1) is negative while f}, the partial effect of xı on E(y*|.x1), is positive.) 
These days, estimating a so-called heteroskedastic probit model (which usually means 
an exponential conditional variance for e) is fairly straightforward. The challenge is 
in intelligently using the resulting estimates. One possibility is to compute partial 
effects from equation (15.57) and average them across the sample. In any case, it 
makes no sense to compare the estimates of p from a heteroskedastic probit with 
those from a standard probit. In each case the MLE estimates adjust to fit the data, 
and this often means that different methods can provide similar partial effects, at 
least for some values of x. 

There is another, more subtle reason to entertain the notion of heteroskedasticity 
in the latent variable model. As discussed in Wooldridge (2005c), the partial effects in 
the model y; = 1[x;f + e; > 0], averaged across the distribution of e;, are the same 
sign as the corresponding ;. In fact, the ratios of APEs for the continuous variables 
are equal to the ratios of the coefficients, even when e; contains heteroskedasticity. To 
see why, it is easiest to work with the average structural function (ASF) defined by 
Blundell and Powell (2004); see also Section 2.2.5. For binary response, the ASF as a 
function of x is 


ASF(x) = E,,{1 [xP + e; > 0]} = P(e; > —xP) = 1 — F(—xf), (15.58) 


where F(-) is the cdf of e; (which may not be symmetrically distributed about zero), 
and we put an į subscript on e; to emphasize it is the random variable that is averaged 
out. From expression (15.58), we can see that the APE of x; has the same sign as f,, 
and, for a continuous xj, the APE is simply £;f(—xf), assuming that F is con- 
tinuously differentiable with density f. The relative APEs for two continuous vari- 
ables x; and x, is B;/f;,—the same conclusion in Section 15.7.1 when we introduced 
heterogeneity additively inside the probit function that was independent of x. 
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That the sign of the APE for x; is given by the sign of £;, and that the relative 
APEs for continuous covariates are given by the ratios of elements of —whether or 
not e is independent of x—suggest that p is still of some interest. In the next sub- 
section we will discuss estimation of J under weaker assumptions (up to a scale fac- 
tor). But what implications does this discussion about APEs have for heteroskedastic 
probit? Interestingly, in the heteroskedastic probit model, we can easily recover the 
APEs. The easiest way is through the ASF. Now, rather than using the unconditional 
distribution of e;, we use iterated expectations, because what we have modeled is 
D(e; | X;) = De; | Xi): 


ASF(x) = Ex, (E{I [xf + e; > 0] | x1}) = Ex, {®[exp(—x;16)xf]}, (15.59) 


where Ef{1[xf+ e; > 0]| xi} = ®[exp(—xi1d)xf] follows from 1[xf + e; > 0] = 
I[exp(—x;10)e; > —exp(—xj16)xf] along with D(exp(—x;10)e; | xi) = Normal(0, 1). 
It is important to see that x is just a fixed argument in (15.59), whereas x; denotes a 
random vector that we average out. For a continuous covariate x;, the partial effect 
on the ASF is 


Ex, {¢lexp(—xi1 6) xf] }B;. (15.60) 


Given the maximum likelihood estimators B and 6 from heteroskedastic probit, a 
consistent estimator of (15.60) is (M~! D dlexp(—xi1d)xf])B;, and then we can in- 
sert interesting values of x or further average out across x;. Bootstrapping would be a 
sensible method for obtaining valid standard errors of the APEs. 

Now we seem to be in a quandary. We have two ways of computing partial effects, 
and they can give conflicting answers, not just on magnitude of the effects but also on 
the direction of the effects. Initially, we discussed how the partial effects based on 
P(y = 1|x)—given for continuous x; in equation (15.57)—need not have the same 
signs as the #;. But equation (15.60) shows that the APEs obtained from averaging 
out e; are, in fact, proportional to the £;. (Of course, to obtain the partial effects in 
either case we cannot ignore 6, as it appears directly in both formulas.) Which partial 
effect is “correct”? Unfortunately, there is a fundamental lack of identification that 
does not allow us to choose between the two. Equation (15.60) was obtained from 
yi = 1[x; + e; > 0] where e; given x; has a heteroskedastic normal distribution. But 
suppose we start with the model y; = 1[x;P + exp(x;16)a; > 0], where a; is indepen- 
dent of x; with a standard normal distribution. Now, it is easily shown that the ASF 
is just the same as the response probability evaluated at x, namely, B[exp(—x)d)xf), 
which takes us back to the partial effects in equation (15.57). In fact, the APEs based 
on (15.57) are commonly reported for heteroskedastic probit, whereas we have just as 
good a case for using (15.60). 
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As discussed by Wooldridge (2005a), on aesthetic grounds we might prefer the 
usual index structure with Var(e | x) heteroskedastic, especially because the APEs are 
easy to obtain and the f; give directions of effects and relative effects. But both 
models nest the usual probit model, and aesthetic considerations cannot change the 
fact that two observationally equivalent models lead to possibly quite different APEs. 
Plus, there is nothing unappealing about index models of the form y; = 1(x;6 + ai + 
a;x;10 > 0) where D(a;|x;) = Normal(0,1); in fact, interacting the unobservable a; 
with a subset of x; makes perfectly good sense, and we studied linear models with this 
feature in Chapters 4 and 6. Because a; is independent of x;, the partial effects defined 
in terms of the ASF are, like those in equation (15.57), the same as the partial effects 
on the response probability p(x). The uncomfortable conclusion is that we have no 
convincing way of choosing between equations (15.57) and (15.60). 


15.7.5 Estimation under Weaker Assumptions 


Probit, logit, and the extensions of these mentioned in the previous subsection are all 
parametric models: P(y = 1|x) depends on a finite number of parameters. There 
have been many advances in estimation of binary response models that relax para- 
metric assumptions on P(y = 1 |x). We briefly discuss some of those here. 

If we are interested in estimating the directions and relative sizes of the partial 
effects, and not the response probabilities, several results are known. In Section 15.6 
we noted a special case—x is multivariate normal—where the slope coefficients in the 
linear projection of y on 1, x, say A, are the partial effects of p(x) averaged across the 
distribution of x—regardless of the true response probability! Chung and Goldberger 
(1984) obtain conditions under which the linear projection identifies the parameters 
up to scale. In fact, in the index formulation y = 1[y* > 0] with y* =a+xB+e, 
Chung and Goldberger simply take « and £ to be the parameters in the linear pro- 
jection of y* on 1, x. Their main result is that if E(x | y*) is linear in y*—sufficient 
but not necessary is that (x, y*) is multivariate normal—then the slope parameters in 
the linear projection of y on 1, x are proportional to p: 2 = tf, where t is the slope 
in the linear projection of y on y*. (For binary response, it can be shown that t > 0.) 
In the standard index model p(x) = G(a+xf), the partial effect for continuous 
covariates are proportional to their betas, and so the linear projection identifies the 
relative partial effects of continuous explanatory variables. Unfortunately, we cannot 
conclude that the 4 are themselves the APEs unless x is multivariate normal. 

Ruud (1983) obtains similar results for the index models estimated by maximum 
likelihood, but where we misspecify G(-)—thereby employing quasi-MLE. Ruud 
obtains a result that the slopes in the misspecified MLE are consistent for tf for an 
unknown scale factor t. Ruud’s key condition is linearity of the conditional means 
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E(x| xf); multivariate normality of x is sufficient but not necessary. Of course, this 
condition is unlikely to be generally satisfied. Ruud (1986) shows how to exploit these 
results to consistently estimate the slope parameters up to scale fairly generally. 

An alternative approach is to explicitly recognize that we do not know the function 
G(-), but the response probability has the index form in equation (15.8). This arises 
from the latent variable formulation (15.9) when e is independent of x but the distri- 
bution of e is not known. There are several semiparametric estimators of the slope 
parameters, up to scale, that do not require knowledge of G. Under certain restric- 
tions on the function G and the distribution of x, the semiparametric estimators are 
consistent and /N-asymptotically normal. See, for example, Stoker (1986), Powell, 
Stock, and Stoker (1989), Ichimura (1993), Klein and Spady (1993), and Ai (1997). 
Powell (1994) contains a survey of these methods. 

Once f is obtained, the function G can be consistently estimated (in a sense we 
cannot make precise here, as G is part of an infinite dimensional space). Thus, the 
response probabilities, as well as the partial effects on these probabilities, can be 
consistently estimated for unknown G. Obtaining G requires nonparametric regression 
of y; on xB, where B are the scaled slope estimators. Accessible treatments of the 
methods used are contained in Stoker (1992), Powell (1994), and Hardle and Linton 
(1994). 

Remarkably, it is possible to estimate f} up to scale without assuming that e and x 
are independent in the model (15.9). In the specification y = 1[xf + e > 0], Manski 
(1975, 1988) shows how to consistently estimate J, subject to a scaling, under the 
assumption that the median of e given x is zero. Some mild restrictions are needed on 
the distribution of x; the most important of these is that at least one element of x with 
nonzero coefficient is essentially continuous. This allows e to have any distribution, 
and e and x can be dependent; for example, Var(e|x) is unrestricted. Manski’s esti- 
mator, called the maximum score estimator, is a least absolute deviations estimator. 
Since the median of y given x is 1[xf > 0], the maximum score estimator solves 


N 
min y, — lx: > 0 
jn Sy 18 > 0) 


over all $ with, say, B'B = 1, or with some element of f fixed at unity if the corre- 
sponding x; is known to appear in Med(y|x). (A normalization is needed because if 
Med(y |x) = 1[x£ > 0] then Med(y |x) = 1[x(t£) > 0] for any t > 0.) The resulting 
estimator is consistent—for a recent proof, see Newey and McFadden (1994)—but 
its limiting distribution is nonnormal. In fact, it converges to its limiting distribution 
at rate N'/3. Horowitz (1992) proposes a smoothed version of the maximum score 
estimator that converges at a rate close to VN. 
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The maximum score estimator’s strength is that it consistently estimates f up to 
scale in cases where the index model (15.8) does not hold. As we saw in the previous 
subsection, when y; = 1[x;B + e; > 0], the APE for x; has same sign as f, and the 
relative APEs for continuous variables are given by /;/f;,. Thus, there is some value 
in estimating the parameters up to a common scale factor. But maximum score esti- 
mation does not allow estimation of the APEs for either continuous or discrete 
covariates because the unconditional distribution of e; is not identified and, without 
making further assumptions, we cannot find P(y; = 1|x;). We should not be too 
surprised by this state of affairs: if our assumptions are weak, we might not be able to 
learn everything we would like to know about how x affects y. 

Lewbel (2000) offers a different approach to estimating the coefficients up to scale 
when e is not independent of x but e and x are uncorrelated. The identification results 
and estimation methods can be found in Lewbel’s paper; here we discuss the key 
assumptions and their applicability. Lewbel’s key assumption is the existence of a 
continuous variable, say xx, with g #0, such that D(e|x) = D(e| x1,22,...,xK-1). 
In other words, conditional on (x1,...,Xx—1), e and xx are independent. Applied to 
the case of heteroskedasticity in the latent variable model, practically speaking the 
conditional independence assumption means E(y*|x) must depend on xx but 
Var(y* |x) must not depend on xx. Therefore, one must be willing to assume that a 
variable definitely affects the conditional mean of y* but not its conditional variance. 
If we parametrically model D(e| x), we need not impose such a restriction, and so the 
semiparametric approach does not uniformly improve on parametric approaches. 
(And, as we already know, we can only learn about relative sizes of the coefficients 
using the semiparametric approach, whereas the parametric approach also delivers 
APEs.) 

As pointed out by Lewbel, one case where his assumption holds is when the latent 
variable model is a random coefficient model of the form yř = baxī +---+ 
bi k-1X; x-1 + PgXik + ai, where bj,...,5; x-1 are random slopes and fg # 0. If the 
vector (ai, bil, ..., bi x-1) is independent of x;, then Lewbel’s key assumption holds. 
The problem is that we have arbitrarily restricted xjx to have a constant, nonzero 
coefficient. Even if x;x is generated randomly and independently of the other cova- 
riates and the unobserved heterogeneity, there is no reason to think that its coefficient 
is constant while other regressors might have heterogeneous coefficients; it is simply 
an arbitrary restriction. (For example, if y* is a latent variable measuring propensity 
to be employed and xx is a job training indicator, random assignment of xx has 
nothing to do with whether job training has differential effects across individuals on 
the propensity to be employed.) If we focus on partial effects, we can allow the entire 
vector b; = (bj1,..., ix)’ to vary along with a;. Then, we know that the APEs on 
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P(y; = 1|x;,a;,b;) are easily obtained from P(y; = 1|x;); see Section 2.2.5. There- 
fore, we can directly specify a flexible model for P(y; = 1 | x;) or derive the response 
probability from P(y; = 1|x;,a;,b;), such as probit, and a distribution for (a;,b/)’, 
such as multivariate normal. 

Some progress has been made in estimating parameters up to scale in the model 
yı = 1[z16) + x12 + uu > 0], where y, might be correlated with u; and z; isa 1 x Ly 
vector of exogenous variables. Lewbel’s (2000) general approach applies to this situ- 
ation as well. Let z be the vector of all exogenous variables uncorrelated with u. 
Then Lewbel requires a continuous element of zı with nonzero coefficient—say, the 
last element, zz, —that does not appear in D(u | y2,z). (Clearly, y, cannot play the 
role of the variable excluded from D(u: | y,z) if y, is thought to be endogenous.) 
When might Lewbel’s exclusion restriction hold? Sufficient is y, = g2(z2) + v2, where 
(u1, v2) is independent of z and z2 does not contain zz,. But this means that we have 
imposed an exclusion restriction on the reduced form of y, something usually dis- 
couraged in parametric contexts. (Putting restrictions on reduced forms is much 
different, and less defensible, than putting retrictions on structural equations.) Ran- 
domization of zz, does not make its exclusion from the reduced form of y, legitimate. 
In fact, one often hopes that an instrument for y, is effectively randomized, which 
means that zz, does not appear in the structural equation but does appear in the 
reduced form of y,—the opposite of Lewbel’s assumption. (As a generic example, 
eligibility is often randomized but participation, y», is not, and then one hopes to use 
an eligibility dummy as an IV for the participation dummy.) 

As we have discussed at several points, when we are actually interested in partial 
effects on response probabilities, estimating coefficients up to an unknown scale is 
usually unsatisfying. Recently, Blundell and Powell (2004) have shown how to esti- 
mate an average structural function—and therefore the APEs—when y, is a con- 
tinuous endogenous explanatory variable. We can think of their method as a 
nonparametric extension of the model in Section 15.7.2. 

Blundell and Powell allow complete flexibility of functional forms, but we can 
illustrate their general approach using an index model y; = 1[xif, + > 0], where 
x; can be any function of (z1, y2). The key assumption is that y, can be written as 
Y2 = go(z) + v2, where (u1, v2) is independent of z. The independence of the additive 
error v2 and z pretty much rules out discreteness in y,, even though g2(-) can be left 
unspecified. Under the independence assumption, 


P(y, = 1| 2,02) = E(y; |Z, 02) = A(x1B), v2) 


for some (generally unknown) function H(-,-). The average structural function is just 
ASF(z1, y2) = Esp [H (x1, vi2)|. We can estimate H and $, quite generally by first 
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estimating the function g2(-) and then obtaining residuals ô = yp — g>(z;). Then, H 
and f; can be estimated in a second step by a semiparametric procedure—that is, one 
that does not assume H is in a parametric family. Then the ASF is estimated by 
averaging out the reduced form residuals, 


N 
ASF (z), y2) =N X A(x, ĉn); (15.61) 
=I 


derivatives and changes can be computed with respect to elements of (z1, y2). 

Blundell and Powell (2004) actually allow P(y, =1|z, y2) to have the general 
form H(z, y2,v2), and then the second-step estimation is entirely nonparametric. 
They also allow g,(-) to be fully nonparametric. But parametric approximations 
in each stage might produce good estimates of the APEs. For example, yp can be 
regressed on flexible functions of z; to obtain ĉn. Then, one can estimate probit or 
logit models in the second stage that include functions of z1, y, and #2 in a flexible 
way—for example, with levels, quadratics, interactions, and maybe even higher- 
order polynomials of each. Then, one simply averages out ĉj, as in equation (15.61). 
Valid standard errors and test statistics can be obtained by bootstrapping or by using 
the delta method. 

Obtaining flexible methods that allow for general discrete y,, which would extend 
the parametric approach of Section 15.7.3, is an important task for future research. 


15.8 Binary Response Models for Panel Data 


When analyzing binary responses in the context of panel data, it is often useful to 
begin with a linear model with an additive, unobserved effect, and then, just as in 
Chapters 10 and 11, use the within transformation or first differencing to remove the 
unobserved effect. A linear probability model for binary outcomes has the same 
problems as in the cross section case. In fact, it is probably less appealing for 
unobserved effects models, as it implies the unnatural restrictions X#ß < ci < 
1 —x;,f,t=1,...,7, on the unobserved effects. To see this, note that the response 
probability for an unobserved effects LPM is P(y,, = 1 |X; ci) = P( yy, = 1 |Xxit, ci) = 
Xuß + c. As in the pure cross section case, the linear functional form is almost cer- 
tainly false, except in special cases. But FE or FD estimation of the LPM might 
provide reasonable estimates of APEs. Plus, the LPM has the advantage of not 
requiring a distributional assumption on D(c; | x;) (which means we do not need to be 
concerned that c; is bounded by linear combinations of xj); nor do we have to 
assume independence of the responses {y;,,..., vir} conditional on (x;, c;) (provided 
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we make our inference robust to serial dependence, as well as heteroskedasticity). The 
nonlinear methods we cover in this section require one or both of these assumptions. 
First, we briefly cover estimation when the model is not specified with unobserved 
heterogeneity and has explanatory variables that may not be strictly exogenous. 


15.8.1 Pooled Probit and Logit 


In Section 13.8 we used a probit model to illustrate partial likelihood methods with 
panel data. Naturally, we can use logit or any other binary response function as well. 
Suppose the model is 


P( yi, = 1| Xi) = G(x: B), t=1,2,...,T (15.62) 


where G(-) is a known function taking on values in the open unit interval. As we 
discussed in Chapter 13, X; can contain a variety of factors, including time dummies, 
interactions of time dummies with time-constant or time-varying variables, and lagged 
dependent variables. 

In specifying the model (15.62) we have not assumed nearly enough to obtain the 
distribution of y; = (Y4, ---, Vir) given x; = (Xi, .- -,X;r), for two reasons. First, we 
have not assumed D(y;,| xi,--.,Xir) = D( Yy | Xx), so that {x7 :t=1,..., T} is not 
necessarily strictly exogenous. Second, even if we assume strict exogeneity, we have 
not restricted the dependence in { y; : t= 1,..., T} conditional on x;. As with many 
pooled MLE problems, we have only specified a model for D(y;, | Xz). If that model 
is correctly specified, we can obtain a V/N-consistent, asymptotically normal estima- 
tor by maximizing the partial (pooled) log-likelihood function. For binary response, 
we maximize the partial log likelihood 


T 
3 X {yin log G(xiiB) + (1 — yin) log[l — G(xirB)]}, 


i=] t=1 


which is simply an exercise in pooled estimation. Without further assumptions, a 
robust variance matrix estimator is needed to account for serial correlation in the 
scores across t; see equation (13.53) with Ê in place of Ê and G in place of ®. Wald 
and score statistics can be computed as in Chapter 12. 

In the case that the model (15.62) is dynamically complete, that is, 


P(Y = 1 | Xi, Vit-1) Xi, t-l; +- -) = P(ya = 1 | Xi), (15.63) 
or 


D( Vir | Xir, Yi, t—1> Xi,t-15-- -) = D(Vit| Xi), 
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inference is considerably easier: all the usual statistics from a probit or logit that 
pools observations and treats the sample as a long independent cross section of size 
NT are valid, including likelihood ratio statistics. Remember, we are definitely not 
assuming independence across ¢ (for example, Xy can contain lagged dependent vari- 
ables). Dynamic completeness implies that the scores are serially uncorrelated across 
t, which is the key condition for the standard inference procedures to be valid. (See 
the general treatment in Section 13.8.) 

To test for dynamic completeness, we can always add a lagged dependent variable 
and possibly lagged explanatory variables. As an alternative, we can derive a simple 
one-degree-of-freedom test that works regardless of what is in Xy. For concreteness, 
we focus on the probit case; other index models are handled similarly. Define up = 
Vit — P(xirB), so that, under assumption (15.63), E(wir | Xit, Yi 7-1, Xir-1,---) = 9, all 
t. It follows that uw; is uncorrelated with any function of the variables (xi, Yi i1, 
Xj 7-1,---), including u; 1-1. By studying equation (13.53), we can see that it is serial 
correlation in the u; that makes the usual inference procedures invalid. Let tj, = 
Vi, — (xaf). Then a simple test is available by using pooled probit to estimate the 
artificial model 


“PU Vin =1 | Xiz, Üi t1) = D(x; h + Yidi t-1)” (15.64) 


using time periods t = 2,..., T. The null hypothesis is Ho : y; = 0. If Ho is rejected, 
then so is assumption (15.63). This is a case where under the null hypothesis, the 
estimation of f required to obtain ù; ;-; does not affect the limiting distribution of 
any of the usual test statistics, Wald, LR, or LM, of Ho : yı = 0. The Wald statistic, 
that is, the ¢ statistic on },, is the easiest to obtain. For the LM and LR statistics we 
must be sure to drop the first time period in estimating the restricted model (y; = 0). 


15.8.2 Unobserved Effects Probit Models under Strict Exogeneity 


A popular model for binary outcomes with panel data is the unobserved effects probit 
model. A key assumption for many of the estimators for unobserved effects probit 
models, as well as other nonlinear panel data models, is that the observed covariates 
are strictly exogenous conditional on the unobserved effect. Now, we need this 
assumption in terms of conditional distributions, which we can state generically as 


Dy, | Xi ci) = D( v4 | Xa, X2,- -3 XiT, Ci) = D( Yy | Xir, ci), t= l (15:65) 


where c; is the unobserved effect and x; = (xj1,Xj2,...,Xir). In stating this assump- 
tion, we are not restricting in any way the joint distribution conditional on (x;, c;), 
which we write as D(y; | x;,c;) for y; = (Ya, ---, vir). As in previous chapters, where 


we stated the strict exogeneity assumption conditional on c; in terms of the condi- 
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tional expectation, assumption (15.65) rules out models with lagged dependent vari- 
ables in x; and also other situations where one or more elements of x; may react in 
the future to idiosyncratic changes in y. Assumption (15.65) also requires that X; 
includes enough lags of underlying explanatory variables if distributed lag dynamics 
are present. 

For the unobserved effects probit model, we specify a response probability that 
fully determines the conditional distribution D( y; | Xir, ci): 


P( yi, = 1| Xin, ci) = (Xup + ci), a lepers des (15.66) 


Most analyses are not convincing unless X; contains a full set of time dummies. In 
what follows, we usually leave time dummies implicit to simplify the notation. 

Our interest is in the response probability (15.66), and we would like to be able to 
estimate the parameters and partial effects just by specifying equation (15.66). But we 
already saw in Chapters 10 and 11 that just specifying a model with contemporane- 
ous conditioning variables is not sufficient for estimating the parameters, and the 
same is true here: without more assumptions, f is not identified. Even adding the 
strict exogeneity assumption (15.65) is not enough. Unlike with linear models, we 
face a further difficulty, which has implications both for estimating f and for esti- 
mating partial effects of interest. Namely, we must specify how c; relates to the 
covariates. 

What happens if we try to proceed without placing restrictions on D(c; | x;)? One 
possibility is to add, in addition to assumptions (15.65) and (15.66), the assumption 
that the responses are independent conditional on (x;, c;): 


Yi- --, Yir are independent conditional on (x;, ¢;). (15.67) 


Because of the presence of c;, the y,, are dependent across t conditional only on the 
observables, x;. (Assumption (15.67) is analogous to the linear model assumption 
that, conditional on (x;,c;), the y, are serially uncorrelated; see Assumption FE.3 
in Chapter 10.) Under assumptions (15.65), (15.66), and (15.67), we can derive the 


density of (y;,,---, Y;r) conditional on (x;, c;): 
T 

SO ¥r xic p) = [A0 xn cip), (15.68) 
1=1 


where f(y,| x, cip) = ®(x,B + c)” [1 — ®(x,B + )]'~”. Ideally, we could estimate 
the quantities of interest without restricting the relationship between c; and the xj. In 
this spirit, we might view the c; as parameters to be estimated along with £, as this 
treatment obviates the need to make assumptions about the distribution of c; given 
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x;. The log-likelihood function is = 1 (ci, p), where 7;(c;, B) is the log of equation 
(15.68) evaluated at the y,,. Unfortunately, in addition to being computationally dif- 
ficult, estimation of the c; along with J introduces an incidental parameters problem. 
Unlike in the linear case, where estimating the c; along with £ leads to the VN- 
consistent FE estimator of f, in the present case estimating the c; (N of them) along 
with £ leads to inconsistent estimation of p with T fixed and N — œœ. We discuss the 
incidental parameters problem in more detail for the unobserved effects logit model 
in Section 15.8.3. 

The estimator of p obtained by treating the c; as parameters has been called the 
“fixed effects probit” estimator, an unfortunate name. As we saw with linear models, 
and as we will see with the logit model in the next subsection and with count data 
models in Chapter 18, in some cases we can consistently estimate the parameters f 
without specifying a distribution for c; given x;. This feature is the hallmark of an FE 
analysis for most microeconometric applications. By contrast, treating the c; as 
parameters to estimate can lead to potentially serious biases, as in the probit case. 

Here we follow the same approach adopted for linear models: we always treat c; 
as an unobservable random variable drawn along with (x;, y;). The question is, under 
what additional assumptions can we consistently estimate parameters, as well as in- 
teresting partial effects? Unfortunately, for the unobserved effects probit model, we 
must make an assumption about the relationship between c; and x;. The traditional 
random effects probit model adds, to assumptions (15.65), (15.66), and (15.67), the 
assumption 


ci |x; ~ Normal(0, a2). (15.69) 


This is a strong assumption, as it implies that c; and x; are independent and that c; 
has a normal distribution. It is not enough to assume that c; and x; are uncorrelated, 
or even that E(c;|x;) = 0. The assumption E(c;) = 0 is without loss of generality 
provided x; contains an intercept, as it always should. 

Before we discuss estimation of the random effects probit model, we should be sure 
we know what we want to estimate. As in Section 15.7.1, consistent estimation of f 
means that we can consistently estimate the partial effects of the elements of x; on the 
response probability P(y, = 1 |x,,c) at the average value of c in the population, 
c = 0. (We can also estimate the relative effects of any two elements of x, for any 
value of c, as the relative effects do not depend on c.) For the reasons discussed in 
Section 15.7.1, APEs are at least as useful. Since c; ~ Normal(0,¢7), the APE for a 
continuous xy is [f,/(1+ o2)'/7)6[x,B/(1+02)'], just as in equation (15.38). 
Therefore, we only need to estimate $, = B/(1 + a2)" ? to estimate the APEs, for 


Binary Response Models 613 


either continuous or discrete explanatory variables. (In other branches of applied 
statistics, such as biostatistics and education, the coefficients indexing the APEs—f. 
in our notation—are called the population-averaged parameters.) 

Under assumptions (15.65), (15.66), (15.67), and (15.69), a conditional maximum 
likelihood approach is available for estimating £ and o?. This is a special case of the 
approach in Section 13.9. Because the c; are not observed, they cannot appear in the 
likelihood function. Instead, we find the joint distribution of (y;,.-.,Y;r) condi- 
tional on x;, a step that requires us to integrate out ci. Since c; has a Normal(0, o2) 
distribution, 


iva) T 
fOr- Yr 1X0) = | i f:l Xis “A) (1/02) (c/a) de, (15.70) 
—0 | t=1 


where f(y,|X+,¢;B) = ®(x,B +c)” {1 —®(x,B+0)]'” and 0 contains B and o?. 
Plugging in y; for all t and taking the log of equation (15.70) gives the conditional 
log likelihood 7¢;(@) for each i. The log-likelihood function for the entire sample of 
size N can be maximized with respect to £ and o? (or $ and a) to obtain VN-consistent 
asymptotically normal estimators; Butler and Moffitt (1982) describe a procedure for 
approximating the integral in equation (15.70). The conditional MLE in this context 
is typically called the random effects probit estimator, and the theory in Section 13.9 
can be applied directly to obtain asymptotic standard errors and test statistics. Since 
B and o? can be estimated, the partial effects at c = 0 as well as the APEs can be 
estimated. Since the variance of the idiosyncratic error in the latent variable model 
is unity, the relative importance of the unobserved effect is measured as p= 
o2/(a2 +1), which is also the correlation between the composite latent error, say, 
ci + ej, across any two time periods. Many random effects probit routines report p 
and its standard error; these statistics lead to an easy test for the presence of the 
unobserved effect. 

Assumptions (15.67) and (15.69) are fairly strong, and it is possible to relax them. 
First consider relaxing assumption (15.67). One useful observation is that, under 
assumptions (15.66) and (15.69) only (even without assumption (15.65)), 


P( yi, = 1| Xi) = O(XiB,), (15.71) 


where £, = B/(1+ o2)”, Therefore, just as in Section 15.8.1, we can estimate $, 
from pooled probit of y, on X, t= 1,...,T, i= 1,...,N, meaning that we directly 
estimate the APEs. If c; is truly present, { y: t= 1,..., T} will not be independent 
conditional on x;. Robust inference is needed to account for the serial dependence, as 
discussed in Section 15.8.1. 
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Rather than simply calculating robust standard errors for ĝ, after pooled probit 
estimation, or using the full random effects (RE) assumptions and obtaining the 
MLE, we can apply the generalized estimating equation (GEE) approach that we 
introduced in Section 12.9. As we discussed there, GEE is simply multivariate 
weighted nonlinear least squares with a correctly specified model for E(y;|x;) but a 
generally misspecified model for the conditional variance matrix Var(y; | x;), where y; 
is the 7-vector of responses for unit i. In the present application, we have correctly 
specified each element of E(y;|x;) as E(y,,|x;) = ®(xi:B,). Plus, each conditional 
variance, Var(y,,|x;), is necessarily equal to Var(y,,|x;) = (xb) ll — (xpe). 
What is difficult to obtain, even under the full set of RE probit assumptions, are the 
conditional correlations, Corr( y; Yis | x;). (In fact, there are no closed-form expres- 
sions for these correlations.) The most straightforward GEE method is to specify 
a constant and exchangeable “working” correlation matrix, which means in using 
WMNILS we act as if Corr(y;,, vis | Xi) =p for |p| < 1. This assumption is almost 
certainly false; the actual conditional correlations likely depend on x; in a compli- 
cated fashion. Nevertheless, as we explained in Section 12.9, the hope is that allowing 
a nonzero correlation in the working variance-covariance matrix will produce an 
estimator asymptotically more efficient than just using pooled probit. At the same 
time, such a WMNLS estimator will be consistent under assumptions (15.65), 
(15.66), and (15.69), the same assumptions we use for consistency of pooled probit. 
Practically, we can think of GEE for RE probit models as a method that has the 
same robustness as pooled probit yet, ideally, gets back some of the efficiency lost by 
ignoring the serial dependence in estimation. 

The terminology used in the GEE literature when applied to nonlinear unobserved 
effects models, such as RE probit, can be elusive. We can minimize our confusion by 
keeping straight the distinctions among a model, the quantities of interest, and an 
estimation method. The response probability conditional only on x, P(yp = 1 | xix) 
= ®(x;,f,.)—for which we only need assumptions (15.66) and (15.69) to derive—is 
called the population-averaged (PA) model in the GEE literature because the param- 
eters, J., index the population-averaged effects. Therefore, using the GEE terminol- 
ogy, the WMNLS estimator is designed to estimate the parameters in the PA model, 
just as is pooled probit. The model under the full set of RE probit assumptions is 
called the subject-specific (SS) model, primarily because the response probability in 
equation (15.66), which is conditional on X; and c;, allows us, in principle, to com- 
pute partial effects for different individuals as described by the heterogeneity, c;. In 
actuality, because c; is not observed (and cannot be estimated with small T), all we 
can really do is estimate the partial effects on P(y,, = 1| xix = Xs, c; = c) for different 
values of c. These values include the mean (median) value, c = 0, but also percentiles 
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in the Normal(0, g?) distribution (because we can estimate o? along with £), but the 
implication that we can estimate partial effects for specific individuals under the full 
RE probit assumptions is somewhat misleading. 

Rather than act as if there are different models, in the context of unobserved effects 
probit models it seems better to organize the discussion around quantities that can be 
estimated under different assumptions on a common model. Much of the time it 
makes sense to start with equation (15.66) as the model of interest. Then we specify 
partial effects at different values of c or APEs as the quantities of interest. Finally, we 
recognize that under (15.66) and (15.69) only, we can estimate the APEs (or PA 
effects) by pooled probit. If we add the strict exogeneity assumption (15.65), then we 
can use GEE. If we use the full set of RE probit assumptions and apply MLE, then 
we can separately estimate # and a2, and therefore the APEs and partial effects at 
interesting values of c. 

A different way to relax assumption (15.67) is to assume a particular correlation 
structure and then use full CMLE. For example, for each ¢ write the latent variable 
model as 


Vie = XuBt+ cit ei, — Ya = [Viz > 0] (15.72) 


and assume that the T x 1 vector e; is multivariate normal, with unit variances, but 
unrestricted correlation matrix. This assumption, along with assumptions (15.65), 
(15.66), and (15.69), fully characterizes the distribution of y; given x;. However, even 
for moderate 7, computation of the CMLE can be very difficult. Recent advances in 
simulation methods of estimation make it possible to estimate such models for fairly 
large T; see, for example, Keane (1993) and Geweke and Keane (2001). The pooled 
probit and GEE procedures that we have described are valid for estimating f,, the 
vector that indexes APEs, regardless of the serial dependence in {e;}, but they are 
inefficient relative to the full CMLE. 

As in the linear case, in many applications the point of introducing the unobserved 
effect, c;, is to explicitly allow unobservables to be correlated with some elements of 
xj;,. Chamberlain (1980) allowed for correlation between c; and x; by assuming a 
conditional normal distribution with linear expectation and constant variance. A 
Mundlak (1978) version of Chamberlain’s assumption is 


ci|x; ~ Normal(p + x;€, 0), (15.73) 


where X; is the average of x;,, t= 1,..., T and a is the variance of a; in the equation 
ci = Y + X;č + a;. (In other words, a2 is the conditional variance of c;, which is 


assumed not to depend on x;.) Chamberlain (1980) allowed more generality by 
having x;, the vector of all explanatory variables across all time periods, in place of 
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x;. We will work with assumption (15.73), as it conserves on parameters; the more 
general model requires a simple notational change. Chamberlain (1980) called model 
(15.66) under assumptions (15.65) and (15.73) a random effects probit model, so we 
refer to the model as Chamberlain’s correlated random effects probit model. While 
assumption (15.73) is restrictive in that it specifies a distribution for c; given x;, it at 
least allows for some dependence between c; and x;. 

As in the linear case, we can only estimate the effects of time-varying elements in 
xj. In particular, x; should no longer contain a constant, as that would be indistin- 
guishable from in assumption (15.73). If our original model contains a time-constant 
explanatory variable, say w;, it can be included among the explanatory variables, but 
we cannot distinguish its effect from c; unless we assume that the coefficient for w; in 
č is zero. (That is, unless we assume that c; is partially uncorrelated with w;.) Time 
dummies, which do not vary across i, are omitted from X;. 

If assumptions (15.65), (15.66), (15.67), and (15.73) hold, estimation of $, Y, č, and 
aż is straightforward because we can write the latent variable as y* = Y + Xiß + 
X;č + a; + ei, where the e; are independent Normal(0,1) variates (conditional on 
(x;,a;)), and a;|x; ~ Normal(0, a7). In other words, by adding X; to the equation for 
each time period, we arrive at a traditional RE probit model. (The variance we esti- 
mate is a? rather than gł, but, as we will see, this suits our purposes nicely.) Adding 
X; as a set of controls for unobserved heterogeneity is very intuitive: we are estimating 
the effect of changing x;; but holding the time average fixed. A test of the usual RE 
probit model is easily obtained as a test of Hp : € = 0. Estimation can be carried out 
using standard RE probit software. Given estimates of y and č, we can estimate 
E(c)) = Y + E(x)é by Ĥ, = f + Xê, where X is the sample average of x;. Therefore, 
for any vector x,, we can estimate the response probability at E(c;) as ®(/, + x,). 
Taking differences or derivatives (with respect to the elements of x,) allows us to 
estimate the partial effects on the response probabilities for any value of x,;. Further, 
we can estimate a? by using the relationship o? = ¢' Var(x;)E +02, and so ô? = 
EIN 7%, (& — F) (3; — XE + 62 is consistent for o2 as N — œ. Therefore, we 
can plug in values of c that are a certain number of estimated standard deviations 
from Å., say ff, + ê.. But there is a subtle point at work here: if € 4 0, c; does not 
generally have an unconditional normal distribution unless x;€ is normally dis- 
tributed; sufficient is that X; is multivariate normal. With large T and weak depen- 
dence in the time series dimension, it might be that x;€ has an approximate normal 
distribution (because it is an average across t). But weak dependence is not neces- 
sarily a good assumption, and even if it were, if we had a large T we might proceed 
by treating the c; as parameters to estimate. Generally, we should not expect the 
distribution of c; to be normal. 
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If we drop assumption (15.67), we can still estimate scaled versions of y, B, and é. 
Under assumptions (15.65), (15.66), and (15.73) we have 


P( yi, = 1|x;) = P(Y + xab + Xé) (1402) 7] 
= O( al XB, oF Xin); (15.74) 


where the a subscript means that a parameter vector has been multiplied by 
(1+ a2)" >| It follows immediately that Wa Bas and č, can be consistently estimated 
using a pooled probit analysis of y; on 1, xz, X; t= 1,...,7,i=1,...,N. Because 
the y, will be dependent condition on x;, inference that is robust to arbitrary time 
dependence is required. We can also use GEE. 

Conveniently, once we have estimated w,, Pa, and €,, we can estimate the APEs. 
(We could apply the results from Section 2.2.5 here, but a direct argument is in- 
structive.) To see how, we need to average P(y, = 1 |x; = x°, c;) across the distribu- 
tion of c;; that is, we need to find E[P(y,=1|x;=x°, c;)] = E[®(x°f + c;)] for any 
given value x° of the explanatory variables. (In what follows, x° is a nonrandom 
vector of numbers that we choose as interesting values of the explanatory variables. 
For emphasis, we include an i subscript on the random variables appearing in the 
expectations.) Writing c; = Y + x;€+ a; and using iterated expectations, we have 
E[O(y + x°fB + X€ + a;)] = E[E{B(w + x°B + XE + ai) | x;}] [where the first expec- 
tation is with respect to (x;,q;)]. Using the same argument from Section 15.7.1, 
E[O(W +x B +X; é+ a) |x] = DY +L + XE} - (1+02)7'7] = O(W, + xB, + KE,), 
and so 


E[O(x°f F ci)] = ED (Ya 7 xB, T Ki€,)]. (15.75) 


Because the only random variable in the right-hand-side expectation is X;, a consis- 
tent estimator of the right-hand side of equation (15.75) is simply 


N 
NTS) Dlh, + x°B, + Xi). (15.76) 
i=l 
APEs can be estimated by evaluating expression (15.76) at two different values for x° 
and forming the difference, or, for continuous variable x;, by using the average across 
i of Bi P(r +x°f +x,é,) to get the approximate APE of a one-unit increase in Xj: 
See also Chamberlain (1984, equation (3.4)). If we use Chamberlain’s more general 
version of assumption (15.73), x; replaces X; everywhere. 

Our focus on the APEs raises an interesting question: How does treating the 
crs as parameters to estimate—in a “fixed effects probit” analysis—affect esti- 
mation of the APEs? Given ¢;, i=1,...,N and Ê, the APEs could be based on 
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No yr 1 (ĉi + xeĝ). Even though Ê does not consistently estimate J and the ĉ; are 
estimates of the incidental parameters, it could be that the resulting estimates of the 
APEs have reasonable properties. In fact, recent progress has been made on this 
question. Fernández-Val (2009) contains theoretical results showing that the incon- 
sistency in APEs constructed using the ĉ; and the accompanying Ê is on the order of 
T~' (and, if there is no heterogeneity, the inconsistency is only O(T~?)). In simu- 
lations, Hahn and Newey (2004) found small biases in the usual probit APEs; they 
suggested bias-corrected estimators that have inconsistency on the order of T~?. In 
related work, Greene (2004), using simulations, found that for even moderate T (say, 
T > 5), the bias in the partial effects evaluated at the average (as opposed to the 
average partial effect) can be quite small. All of these papers maintain the conditional 
independence assumption (15.67) and, in addition, the assumption that the covariates 
are independent, identically distributed across t. Hopefully, the findings carry over to 
more realistic scenarios that allow time dependence in the covariates as well as vio- 
lations of (15.67). 

Under assumptions (15.65), (15.66), and the more general version of assumption 
(15.73), Chamberlain (1980) suggested a minimum distance approach analogous to 
the linear case (see Section 14.6.2). Namely, obtain z, for each ¢ by running a cross- 
sectional probit of y; on 1, x; i=1,...,N. The mapping from the structural 
parameters 0, = (Y4, B}, EL)’ to the vector ~ is exactly as in the linear case (see Sec- 
tion 14.6.2). The variance matrix estimator for z is obtained by pooling all T probits 
and computing the robust variance matrix estimator in equation (13.53), with 0 
replaced by ñ; see also Chamberlain (1984, Section 4.5). The minimum distance 
approach leads to a straightforward test of Ho : č, = 0, which is a test of assumption 
(15.69) that does not impose assumption (15.67). 

Strict exogeneity of the covariates conditional on c; is critical for the previous 
analysis. As mentioned earlier, this assumption rules out lagged dependent variables, 
a case we consider explicitly in Section 15.8.4. But there are other cases where strict 
exogeneity is questionable. For example, suppose that y,, is an employment indicator 
for person i in period ¢ and w; is measure of recent arrests. It is possible that whether 
someone is employed in this period has an effect on future arrests. If so, then shocks 
that affect employment status could be correlated with future arrests, and such cor- 
relation would violate strict exogeneity. Whether this situation is empirically impor- 
tant is largely unknown. 

On the one hand, correcting for an explanatory variable that is not strictly exogen- 
ous is quite difficult in nonlinear models; Wooldridge (2000) suggests one possible 
approach. On the other hand, obtaining a test of strict exogeneity is fairly easy. Let 
w; denote a 1 x G subset of xy that we suspect of failing the strict exogeneity re- 
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quirement. Then a simple test is to add w;,;;; as an additional set of covariates; under 
the null hypothesis, w; ;.; should be insignificant. In implementing this test, we can 
use either RE probit or pooled probit, where, in either case, we lose the last time 
period. (In the pooled probit case, we should use a fully robust Wald or score test.) 
We should still obtain x; from all T time periods, as the test is either based on the dis- 
tribution of (y1,.--,¥j;,7-1) given (xi1,-..,Xir) (RE probit) or on the marginal dis- 
tributions of y, given (xi1,...,Xir), t= 1,..., T — 1 (pooled probit). If the test does 
not reject, it provides at least some justification for the strict exogeneity assumption. 


15.8.3 Unobserved Effects Logit Models under Strict Exogeneity 


The unobserved effects probit model of the previous subsection has a logit counter- 
part. In the leading case, the heterogeneity, c;, is still assumed to satisfy (15.69), but 
we replace assumption (15.66) with the response probability 


P(y;, = 1| Xir, ci) = A(Xuß + ci), t= lesit (15.77) 


where A(-) is the logistic function. If we maintain the full set of assumptions, given by 
the strict exogeneity condition (15.65), the conditional independence assumption 
(15.67), the normality assumption (15.69), and the model in equation (15.77), we 
arrive at the random effects logit model. In the GEE literature, the RE logit model is 
an example of a subject-specific model (like the RE probit model). 

From a computational standpoint, the RE logit model is less desirable than the RE 
probit model. Even if we focus on a pooled method under assumptions (15.69) and 
(15.77), estimation is complicated because the response probability obtained by inte- 
grating out c;, 


P( yi, = 1 |X) = | A(xiiB + ¢)(1/ae)@(¢/Ge) de, (15.78) 
does not have a closed form. In fact, one rarely sees equation (15.78) used as the sole 
basis for estimating £ and o2, whether using a pooled binary response method or a 
GEE method under assumption (15.65). (Unlike in the probit model, £ and a? are 
separately identified in (15.78), although that does not mean they are easy to esti- 
mate.) Interestingly, although it seems natural to define (15.78), along with the strict 
exogeneity assumption, as constituting the PA version of the SS model, that is not the 
designation in the GEE literature. Instead, the PA model is specified as 


P( yj, = 1 | x1) = PCV, = 1 | X4) = A(X), (15.79) 


66 99 


where we use the “c” subscript on f, to emphasize that the beta vector in equation 
(15.79) cannot be the same as $ in equation (15.77) (and to draw an analogy with 
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probit). In fact, (15.79) is incompatible with (15.77); in specifying (15.79), we are 
abandoning the more structural model in (15.77) for the sake of expediency. In the 
GEE literature more generally, a PA model is usually specified to have a convenient 
functional form for the conditional mean, E( y; | Xx) without worrying about whether 
it can be derived from an underlying unobserved effects model. 

If we estimate the model in (15.77) under the full set of RE logit assumptions (or 
even if we just use equation (15.78)), the partial effects at interesting values of c are 
easier to obtain than the APEs: we just use differences or derivatives of A(x,f + c) 
with respect to elements of x, and plug in interesting values of c, which can be 
determined because c; is distributed as Normal(0, o2). Further, we can apply the 
Mundlak-Chamberlain device just as in assumption (15.73), and then we can estimate 
the unconditional mean and variance of c; as in the probit case. If we want to allow 
D(c; | x;) to depend on x; but do not want to use RE logit as the underlying model, 
we can adopt a strategy that extends the standard PA model strategy. Namely, we 
just specify 


P( yy = 1|x;) = A(W, + Xiha + XE), t=1,...,T7, (15.80) 


6699 


where we use the “a” subscript to indicate a new set of parameters (while again 
drawing an analogy to the probit case). Given equation (15.80), we can estimate the 
parameters by pooled logit or GEE, and then, to obtain APEs, compute the ASF 
NON, A(G, + XB, + ¥i€,) for given x. Although equation (15.80) cannot be 
derived from (15.77), it might provide a good approximation to the APEs. 

The real advantage of specifying equation (15.77) in place of (15.66) is that under 
the logit specification, we can obtain a //N-consistent, asymptotically normal esti- 
mator of $ without any assumptions on D(c;|x;), provided, of course, that each ele- 
ment of x; is time varying. In addition to assumption (15.77), we maintain the strict 
exogeneity assumption (15.65) and the conditional independence assumption (15.67). 

How can we allow c; and x; to be arbitrarily related in the unobserved effects logit 
model? In the linear case we used the FE or FD transformation to eliminate c; from 
the estimating equation. It turns out that a similar strategy works in the logit case, 
although the argument is more subtle. What we do is find the joint distribution 
of y; = (Va;---, ir)! conditional on x;, c; and n; = Y£; Ya. It turns out that this 
conditional distribution does not depend on c;, so that it is also the distribution of y; 
given x; and n;. Therefore, we can use standard CMLE methods to estimate $. (The 
fact that we can find a conditional distribution that does not depend on the c; is a 
feature of the logit functional form. Unfortunately, the same argument does not work 
for the unobserved effects probit model.) 


Binary Response Models 621 


First consider the T = 2 case, where n; takes a value in {0,1,2}. Intuitively, the 
conditional distribution of (3,1, yj)’ given n; cannot be informative for £ when n; = 0 
or n; = 2 because these values completely determine the outcome on y,;. However, for 
nj = 1, 


P(ya = 1| Xi, 4, = 1) = P(yp = 1,0; = 1| x), c;)/P(n; = 1 |X; c) 

= P(vi2 = 1] xi, e) Pa = 0 |x; ci) /{P(Ya = 9, Yin = 1] Xi, ci) 
+ P( Yi = 1, ya = 0 xi, c;)} 
= A (Xab + c) [l -Axab + eA — AGB + Anp + ci) 
+ A(xiB + e)l- AXL + ci) |} = Afla — xa). 


Similarly, P( yj) = 1|X;, cnn; = 1) = A[—(xi2 — xi) 8] = 1 — A[(xi2 — Xa) B]. The 
conditional log likelihood for observation i is 


AB) = 1[n; = 1](w; log A[(x;2 — x1) B] + (1 — w;) log{1 — A[(xj2 — x,1)B]}), (15.81) 


where w; = 1 if (ya = 9, ya = 1) and w; = 0 if (ya = 1, ya = 0). The CMLE is ob- 
tained by maximizing the sum of the 7;(f) across i. The indicator function 1[n; = 1] 
selects out the observations for which n; = 1; as stated earlier, observations for 
which n; = 0 or n; = 2 do not contribute to the log likelihood. Interestingly, equation 
(15.81) is just a standard cross-sectional logit of w; on (xj;2 — xj) using the observa- 
tions for which n; = 1. (This approach is analogous to differencing in the linear case 
with T = 2.) 

The CMLE from equation (15.81) is often called the fixed effects logit estimator 
and sometimes called the conditional logit estimator. We must emphasize that the FE 
logit estimator does not arise by treating the c; as parameters to be estimated along 
with £. (This convention is confusing, as the FE probit estimator does estimate the c; 
along with $.) As shown recently by Abrevaya (1997), the MLE of £ that is obtained 
by maximizing the log likelihood over f, and the c; has probability limit 28. (This 
finding extends a simple example due to Andersen, 1970; see also Hsiao, 1986, Sec- 
tion 7.3.) 

Sometimes the CMLE is described as “conditioning on the unobserved effects in 
the sample.” This description is misleading. What we have done is found a condi- 
tional density—which describes the subpopulation with n; = 1—that depends only 
on observable data and the parameter f. One should not expend energy worrying 
about the loss of observations for which y; = yp. Because D(c; | x;) is unrestricted, c; 
is free to vary as much as required to make P(y,, = 1 | Xx, c;) arbitarily close to one (if 
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Vie = l, t= 1,2) or arbitrarily close to zero (if y = 0, t= 1,2). When y; does not 
change across ¢, c; can adjust to make both observed outcomes occur with certainty. 
Clearly, such data points contain no information for estimating f, and so they should 
drop out of the estimation. 

For general T the log likelihood is more complicated, but it is tractable. First, 


P(Ya = Yis- Vir = Yr | Xi, Ci, Mi = N) 
= P(ya = Yis- . +) Sir = Yr | Xi, ci) /P(ni = n| Xi, ci), (15.82) 


and the numerator factors as P(ya = y1 |X; ci) -++ P(Yir = Yr | Xi, ci) by the condi- 
tional independence assumption. The denominator is the complicated part, but it is 
easy to describe: P(n; = n | x;, ci) is the sum of the probabilities of all possible out- 
comes of y, such that n; = n. Using the specific form of the logit function we can write 


T T = 
¢i(B) = log, exp (>: vis) z exp (>: ce) (15.83) 
t=1 t=1 


acR; 


where R; is the subset of IR? defined as {ae R’: a€ {0,1} and yh a; = ni}. The 
log likelihood summed across i can be used to obtain a V N-asymptotically normal 
estimator of f, and all inference follows from conditional MLE theory. Observations 
for which equation (15.82) is zero or unity—and which therefore do not depend on 
B—drop out of Z (p). See Chamberlain (1984). 

The FE logit estimator Î immediately gives us the effect of each element of x; on 
the log-odds ratio, log{A(x,f8 + c)/[1 — A(x,B + c)|} = x: +c. Unfortunately, we 
cannot estimate the partial effects on the response probabilities unless we plug in a 
value for c. Because the distribution of c; is unrestricted—in particular, E(c;) is not 
necessarily zero—it is hard to know what to plug in for c. In addition, we cannot 
estimate APEs, as doing so would require finding E[A(x/f + c;)], a task that appar- 
ently requires specifying a distribution for c;. 

The conditional logit approach also has the drawback of apparently requiring the 
conditional independence assumption (15.67) for consistency. As we saw in Section 
15.8.2, if we are willing to make the normality assumption (15.73), the probit approach 
allows unrestricted serial dependence in y,, even after conditioning on x; and c;. This 
possibility may be especially important when several time periods are available. 

Now that we have covered all of the leading estimation methods for unobserved 
effects models with strictly exogenous explanatory variables, we can provide an ex- 
ample comparing the different approaches. We use a panel data set on the labor force 
participation of married women from Chay and Hyslop (2001). 
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Table 15.3 
Panel Data Models for Married Women’s Labor Force Participation 


(1) (2) (3) (4) (5) 


Chamberlain’s Chamberlain’s 
Model Linear Probit RE Probit RE Probit FE Logit 
Fixed 
Effects Pooled MLE Pooled MLE MLE MLE 
Estimation Coeffi- Coeffi- Coeffi- Coeffi- 
Method Coefficient cient APE cient APE cient APE cient 
kids —.0389 —.199 —.0660 —.117 —.0389 317 .0403 644 
(.0092) (.015) (.0048) (.027) (.0085) (.062) (x) (.125) 
Ihinc —.0089 —.211 —.0701 —.029 —.0095 .078 .0099 .184 
(.0046) (.024) (.0079) (.014) (.0048) (.041) (x) (.083) 
kids — .086 .210 
— — — (.031) — (071) — — 
lhinc — .250 .646 
— — — (.035) — (.079) — — 
(a — = = 387 = 
Log — —16,556.67 —16,516.44 —8,990.09 —2,003.42 
likelihood 
Number of 5,663 5,663 5,663 5,663 1,055 
women 


Standard errors are in parentheses below all coefficients or APEs. For the linear model, pooled probit, and 
Chamberlain’s RE probit estimated by pooled MLE, the standard errors are robust to arbitrary serial 
correlation. 

All models include a full set of period dummy variables (unreported). 

Columns (2), (3), and (4) also include the time-constant variables educ, black, age, and age? (unreported). 
Because of prohibitive computation times, bootstrap standard errors were not computed for the RE probit 
APEs estimated by MLE. 


Example 15.5 (Panel Data Models for Women’s Labor Force Participation): The 
data set LFP.RAW contains data on N = 5,663 married women over T = 5 periods, 
where the periods are spaced four months apart. The response variable is /fpi;, a labor 
force participation indicator. The key explanatory variables are Aids; (number of 
children under 18) and /hinc;, = log(hinc;,) (where husband’s income, hincj,, is in 
dollars per month and is positive for all i and ¢). We also include the time-constant 
variables educ (years of schooling in the first period), black (a binary race indicator), 
age (age in the first period), and age’; these drop out of the linear FE and FE logit 
estimation. 

The findings in Table 15.3 reveal a consistent pattern: allowing unobserved heter- 
ogeneity to be correlated with kids;, and /hinc;, has important effects on the estimated 
APEs. Estimating the LPM by FE gives estimated coefficients of roughly —.039 and 
—.009 on the kids;, and /hinc;, variables, respectively, with the coefficient on /hinc;, 
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being marginally statistically significant. Each child is estimated to reduce the labor 
force participation probability by about .039, while a 10% increase in a husband’s 
income lowers the probability by only .0009. If we use probit and assume c; is inde- 
pendent of x;, the APEs are much higher, especially on the income variable, which is 
almost 10 times the LPM coefficient. Could this difference be due to the different 
functional forms? Column (3) makes it clear that the difference in APEs between 
columns (1) and (2) is due to the restriction in (2) that c; is independent of x;. When 
we use the Chamberlain-Mundlak device and use pooled probit, we obtain APEs that 
are very similar to the LPM estimates. In fact, to four decimal places, the APE esti- 
mates on kids; are identical. Plus, we can see directly that the coefficients on the time 
averages are very Statistically significant and practically very large; each is much 
larger than the corresponding coefficient on the time-varying variable. 

Using full MLE, rather than PMLE, to estimate Chamberlain’s model does not 
change any conclusions. When we multiply the MLE coefficients by the scale factor 
.387, the results are very similar to the pooled MLE coefficients (—.1227 for kids; and 
—.030 for /hinc;,). In addition, the APE estimates are very similar to the pooled esti- 
mation; interestingly, the bootstrap standard errors for the APEs are actually some- 
what higher than for the pooled probit estimates. 

Finally, column (5) contains the estimates from FE logit. As we discussed, the co- 
efficient magnitudes are difficult to interpret. But the relative size is .644/.184 = 3.50, 
which is not terribly different from, say, the ratio of coefficients using the pooled 
MLE estimates of Chamberlain’s model, .117/.029 = 4.03. But if we do not control 
for the time averages, the estimated ratios are much different, .199/.211 = .94. Be- 
cause Chamberlain’s approach yields APEs, and the estimates in column (3) allow for 
arbitrary serial dependence, it seems sensible to rely on these estimates (assuming, of 
course, that we believe kids; and /hinc;, are strictly exogenous conditional on c;). 


Because we have covered several different possibilities for modeling and estimation 
in the context of unobserved effects with binary responses, it is helpful to have a 
simple summary of the strengths and weaknesses of each approach. Table 15.4 con- 
tains such a summary. Each method is evaluated by answers to five questions: (1) Is 
the response probability contained in the unit interval? (2) Is the distribution of the 
heterogeneity, given the covariates, restricted? (3) Is serial dependence allowed after 
accounting for c;? (4) Are the partial effects at the mean heterogeneity identified? (5) 
Are the APEs identified? 

Table 15.4 reveals a simple but important point that is easily missed: no procedure 
strictly dominates the others. When deciding on a method or methods, one needs to 
determine how important each factor is likely to be. For some features, this is easy. 
For example, hopefully we know whether having consistent parameter estimates is 
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Table 15.4 
Summary of Features of Models and Estimation Methods for Unobserved Effects Binary Response Models 
P( Ya = 
1| Xit, ci) Idiosyncratic Partial 

Model, Estimation Bounded Restricts Serial Effects 

Method in (0,1)? D(c; | xi)? Dependence? at E(c;)? | APEs? 

RE probit, MLE Yes Yes (independence, No Yes Yes 
normal 

RE probit, pooled MLE Yes Yes (independence, Yes No Yes 
normal 

RE probit, GEE Yes Yes (independence, Yes No Yes 
normal 

Chamberlain’s RE probit, Yes Yes (linear mean, No Yes Yes 

MLE normal 

Chamberlain’s RE probit, Yes Yes (linear mean, Yes No Yes 

pooled MLE normal 

Chamberlain’s RE probit, Yes Yes (linear mean, Yes No Yes 

GEE normal 

LPM, within No No Yes Yes Yes 

FE logit, MLE Yes No No No No 


enough for our purposes or whether we want to estimate the magnitudes of effects. 
Unfortunately, it is difficult to decide issues that are essentially empirical in nature, 
such as how important functional form restrictions are on P(y;,=1|Xir,cj) or 
D(c; | X;) è 


15.8.4 Dynamic Unobserved Effects Models 


Dynamic models that also contain unobserved effects are important in testing theories 
and evaluating policies. Here we cover one class of models that illustrates the impor- 
tant points for general dynamic models and is of considerable interest in its own 
right. Our treatment follows Wooldridge (2005b). 

Suppose we date our observations starting at t = 0, so that yj, is the first obser- 


vation on y. For t=1,...,7 we are interested in the dynamic unobserved effects 
model 

PU Vie = 1| Vi as +++ Yio Zi, Ci) = G(Zuô + py; 1 + i), (15.84) 
where Z; is a vector of contemporaneous explanatory variables, z; = (zi,...,Zir), 


and G can be the probit or logit function. There are several important points about 
this model. First, the Z; are assumed to satisfy a strict exogeneity assumption (con- 
ditional on c;), since z; appears in the conditioning set on the left-hand side of equa- 
tion (15.84), but only z;, appears on the right-hand side. Second, the probability of 
success at time ¢ is allowed to depend on the outcome in ¢ — 1 as well as unobserved 
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heterogeneity, cj. We saw the linear version in Section 11.6.2. Of particular interest is 
the hypothesis Ho : p = 0. Under this null, the response probability at time ¢ does not 
depend on past outcomes once c; (and z;) have been controlled for. Even if p = 0, 
P(ya = 1| Vi 1,21) Æ P(¥i, = 1 |z;) owing to the presence of c;. But economists are 
interested in whether there is state dependence—that is, p # 0 in equation (15.84)— 
after controlling for the unobserved heterogeneity, c;. 

We might also be interested in the effects of z;, as it may contain policy variables. 
Then, equation (15.84) simply captures the fact that, in addition to an unobserved 
effect, behavior may depend on past observed behavior. 

How can we estimate ô and p in equation (15.84), in addition to quantities such as 
APEs? First, we can always write 


T 
SF (Vis Var +++ Yr | Yo26 B) = [ [fOr ye Yi ozo 5B) 
t=1 


= |] Ged + pyi +0)" [1 - Gð + py, +0)". 
t=1 


(15.85) 


With fixed-T asymptotics, this density, because of the unobserved effect c, does not 
allow us to construct a log-likelihood function that can be used to estimate f con- 
sistently. Just as in the case with strictly exogenous explanatory variables, treating 
the c; as parameters to be estimated does not result in consistent estimators of ô and 
pas N — œ. In fact, the simulations in Heckman (1981) show that the incidental 
parameters problem is even more severe in dynamic models. What we should do is 
integrate out the unobserved effect c, as we discussed generally in Section 13.9.2. 
Our need to integrate c out of the distribution raises the issue of how we treat the 
initial observations, y;ọ; this is usually called the initial conditions problem. One pos- 
sibility is to treat each y;ọ as a nonstochastic starting position for each 7. Then, if c; is 
assumed to be independent of z; (as in a pure RE environment), equation (15.85) can 
be integrated against the density of c to obtain the density of (y1, ¥2,..., Yr) given z; 
this density also depends on yọ through f(y, | yo, c, z1; P). We can then apply CMLE. 
Although treating the y;ọ as nonrandom simplifies estimation, it is undesirable be- 
cause it effectively means that c; and y,, are independent, a very strong assumption. 
There seems to be confusion in the literature about when it is plausible to treat the 
Yio as fixed, and therefore independent of c;. Some have justified this assumption 
when one observes the process generating y; from its beginning, as would happen if 
we follow the employment history of a cohort of high school graduates who do not 
pursue additional education, with y; being an employment indicator in the initial 


Binary Response Models 627 


postgraduation period. Does it make sense to assume yj and c; are independent? 
Almost certainly not. Because c; contains unobserved attributes that affect y,, in 
periods ¢ > 1, it is almost certain that an individual’s initial employment status is re- 
lated to c;. In this example and most others, treating the yj) as independent of c; is 
risky, regardless of when the process underlying the panel data actually started. 

Another possibility is to first specify a density for y;ọ given (z;,c;) and to multiply 
this density by equation (15.85) to obtain f(y,¥1,¥2,---;¥r|Z,GB,y). Next, a 
density for c; given z; can be specified. Finally, f (Yọ, Y1, Y2,- --, Yr |Z, c; P, y) is inte- 
grated against the density h(c|z;a) to obtain the density of (Vio, Ya, Viz,- -> Yir) 
given z;. This density can then be used in an MLE analysis. The problem with this 
approach is that finding the density of y,) given (z;,c;) is very difficult, if not impos- 
sible, even if the process is assumed to be in equilibrium. For discussion, see Hsiao 
(2003, Section 7.5). 

Heckman (1981) suggests approximating the conditional density of y;ọ given (z;, c;) 
and then specifying a density for c; given z;. For example, we might assume that y,9 
follows a probit model with success probability ®(7 + zit + yc;) and specify the den- 
sity of c; given z; as normal. Once these two densities are given, they can be multi- 
plied by equation (15.85), and c can be integrated out to approximate the density of 
(Vio. Yil Vins +++ Vir) given z;; see Hsiao (2003, Section 7.5) or Wooldridge (2005b). 

Heckman’s (1981) approach attempts to find or approximate the joint distribution 
of (Vio; Yas Vi2s-- +> Vir) given z;. We discussed an alternative approach in Section 
13.9.2: obtain the joint distribution of (y;,, ¥j9,---, Yir) conditional on (y;9,z;). This 
allows us to remain agnostic about the distribution of y;ọ given (z;,¢;), which is the 
primary source of difficulty in Heckman’s approach. If we can find the density of 
(Vis Yiz- --, Vir) given (Vio, Zi), in terms of f and other parameters, then we can use 
standard CMLE methods: we are simply conditioning on yj in addition to z;. It is 
important to see that using the density of (Ya, ¥j2,---, Vir) given (jo, Zi) is not the 
same as treating y;ọ as nonrandom. Indeed, the model with c; independent of y;o, 
given z;, is a special case. 

To obtain f(y1, Y2,---, Yr | Yio, Zi), we need to propose a density for c; given 
(Yio; Zi). This approach is very much like Chamberlain’s (1980) approach to static 
probit models with unobserved effects, except that we now condition on y;ọ as well. 
(Since the density of c; given z; is not restricted by the specification (15.85), our 
choice of the density of c; given (y;9,z;) is not logically restricted in any way.) Given 
a density A(c | yọ, z; y), which depends on a vector of parameters y, we have 


oe) 


FfOn» yrl 30:28) = | A (Vis Yo- Yr | Yos Z, 6; B)A(c| Yo, z; y) de. 


=00 
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See Property CD.2 in Chapter 13. The integral can be replaced with a weighted 
average if the distribution of c is discrete. When G = ® in the model (15.84)—the 


leading case—a very convenient choice for h(c| yọ, z; y) is Normal( + éo Yio + 2:6, 
aż), which follows by writing c; = Y + o Yio + 2i€ + a;, where a; ~ Normal(0,c7) 


a 


and independent of (y;9,z;). Then we can write 


Vie = WW + 210 + py. t1 + CoVig + Zič + ai + ei > 0), 


so that y; given (Y; 1<- - Vio, Zi, 4i) follows a probit model and a; given (y;,9,Z;) is 
distributed as Normal(0,a7). Therefore, the density of (y,,,..., Y;r) given (Vj, Zi) 
has exactly the form in equation (15.70), where xj; = (1, Zi Vi +1, Vio, Zi) and with a 
and go, replacing c and a, respectively. Conveniently, this finding means that we can 
use standard RE probit software to estimate Y, ô, p, čo, č, and a7: we simply expand 
the list of explanatory variables to include y;ọ and z; in each time period. (The 
approach that treats y;ọ and z; as fixed omits y;ọ and z; in each time period.) It is 
simple to test Ho : p = 0, which means there is no state dependence once we control 
for an unobserved effect. 

We can estimate APEs as in Chamberlain’s unobserved effects probit model with 
strictly exogenous explanatory variables, but now we average out the initial condition 
along with the leads and lags of all strictly exogenous variables. Let z; and y,_, be 
given values of the explanatory variables. Then the ASF, E[®(z,6 + pyi-1 + ci)] = 
E[O(wW, + 20a + Padi-1 + ČaoYio + Ziča)], is consistently estimated as ASF(z;, y:+-1) = 
N! P 1 O(W, + Zô, + ÊaYt-1 + €0Vi0 + ziÊê,), where the a subscript denotes that 
the original coefficients have been multiplied by (1 + gay > and w, ô, f, ĉo, Ê, and 
62 are the CMLEs that would be reported directly by any econometrics package that 
estimates RE probit models. We can take derivatives of this expression with respect 
to continuous elements of z,, or take differences with respect to discrete elements. Of 
particular interest is to alternatively set y,_; = 1 and y,_, = 0 and obtain the change 
in the probability that y, = 1 when y,_, goes from zero to one. Probably we would 
average across Z;,, too. To obtain a single APE, we can also average across all time 
periods. The key is that we average across (y,),z;) to estimate the ASF, and then 
from there we decide which partial effects are of interest. Wooldridge (2005b) con- 
tains further discussion, and we compute the APE for the lagged dependent variable 
in the example below. 

In estimating the dynamic model just described, it is important to understand that 
a simpler, pooled estimation method does not consistently estimate the scaled 
parameters or the APEs. In other words, we cannot simply use pooled probit of y,, 
on 1, Zit, YVit-1 Vio, Zi: The problem is that, while P(y, = 1| Zi, Vi t1,- --, Vio, 4i) = 
P(Y + 216 + PYi -1 + Gov + Zič + ai), it is not true that P(y,;, = 1|z;, vir-1,---, Vio) 
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= O(W, + Zia + Py Vi.t-1 + ČaoVio + Zi€,) unless a; is identically zero, which means 
that c; is a deterministic linear function of (yj9,z;). Correlation between y; ,_; and a; 
means that P(y,, = 1] Zi, ¥i,1-1,---,; Yio) does not follow a probit model with index 
that depends on the scaled coefficients of interest. Therefore, we should use the RE 
probit approach, and not pooled probit (as in Section 15.8.1). For comparison pur- 
poses, we can estimate the dynamic model without the unobserved effect, c;, and then 
pooled probit is the appropriate estimation method. 


Example 15.6 (Dynamic Women’s LFP Equation): We now use the data in 
LFP.RAW to estimate a model for P(/fpir = 1 | kids, Ihinci, Ifpi,1-1, Ci), where one 
lag of labor force participation is assumed to suffice for the dynamics and 
{ (kidsj,, lhinc,) : t= 1,..., T} is assumed to be strictly exogenous conditional on c;. 
Also, we include the time-constant variables educ, black, age, and age? and a full set 
of time-period dummies. (We start with five periods and lose one with the lag. 
Therefore, we estimate the model using four years of data.) We include among the 
regressors the initial value, /fpio, kids; through kids;4, and /hinc;, through /hincj4. (To 
keep the notation consistent with the previous development, we implicitly relabel the 
time periods in the data set.) Estimating the model by RE probit gives p = 1.541 
(se = .067), and so, even after controlling for unobserved heterogeneity, there is 
strong evidence of state dependence. But to obtain the size of the effect, we compute 
the APE for /fp,_;. The calculation involves averaging Dý, + Lindy + Pat Ê ov + 
ziĉ,) = oý, + Zra + E oyo + zê) across all ¢ and i; we must be sure to scale the 
original coefficients by (1 + 62)-V/ ? where, in this application, 6? = 1.103. The APE 
estimated from this method is about .260 (panel bootstrap standard error is .026 with 
500 replications). In other words, averaged across all women and all time periods, the 
probability of being in the labor force at time ¢ is about .26 higher if the woman was 
in the labor force at time ¢ — 1 than if she was not. This estimate controls for unob- 
served heterogeneity, number of young children, husband’s income, and the woman’s 
education, race, and age. 

It is instructive to compare the APE with the estimate of a dynamic probit model 
that ignores c;. In this case, we just use pooled probit of lfpu on 1, kids, lhinciz, 
Ifpii-1, educ;, blacki, age;, and age? and include a full set of period dummies. The 
coefficient on /fp;,;-1 is 2.876 (se = .027), which is much higher than in the dynamic 
RE probit model. More important, the APE for state dependence is about .837 
(panel bootstrap se = .005), which is much higher than when heterogeneity is con- 
trolled for. Therefore, in this example, much of the persistence in labor force partic- 
ipation of married women is accounted for by the unobserved heterogeneity. There is 
still some state dependence, but its value is much smaller than a simple dynamic 
probit indicates. 
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As mentioned earlier, Wooldridge (2000) proposes an extension of the preceding 
approach to allow other not strictly exogenous explanatory variables to appear in 
unobserved effects probit models. Generally, if wx is a variable that is sequentially 
but not strictly exogenous, we model P(y;, = 1 | wit, Zir, Yi,t-1, Ci1) and assume this is 
the same probability when we include all lags of w, further lags of y,,, and the entire 
history of the strictly exogenous variables, zy. (Allowing a lagged value of w; only 
changes the notation.) We then add a model for the density of D(wi: | Zit, Wi, 1, Yi,t-1 
cn), assuming that one lag each of wp and y; is sufficient to capture the dynamics, 
and that z; is also strictly exogenous in this distribution. The joint density of (;,, wir) 


given (Zi, Wit-1,---, Wid, Vi,t=1; -+ -3 Vio, i) is fa y| Wr Zit, Yi,t-1, Ci) © fa(Wi | Zit, 
Wi t1, Yi,t-1, C2) = gil Yi, Wi | Zi, Wi,t—1, - + +, WiO, Yi,t-1; -+ +, Yio, Ci, C2). By multiplying 
these densities from ¢ = 1 to T we obtain the density of (y;, Wit, -<< , vir, Wir) given 


(Zi, Vio, Wio, Ci). Now, as before, if we specify a density for D(c; | Z;, vio, Wio), we can 
obtain a density for D(y;, W; | zi, vio, wio) by integrating out c;. We can construct a 
log-likelihood function for estimating the parameters in both conditional densities, as 
well as the parameters in the heterogeneity distribution. The problem is significantly 
harder if we truly allow two sources of heterogeneity in c; = (cj, €n), especially if we 
allow them to be correlated, but conditioning on the initial conditions (9, wio) helps 
simplify the estimation. Wooldridge (2000) provides more details when w; is a binary 
response that may react to changes in past values of y,,, thereby causing it to violate 
the strict exogeneity assumption. 


15.8.5  Probit Models with Heterogeneity and Endogenous Explanatory Variables 


The previous methods assumed that the explanatory variables satisfy a strict exoge- 
neity assumption or a sequential exogeneity assumption. We can also use probit 
models to account for unobserved heterogeneity and contemporaneous endogeneity. 
We focus on two cases: (1) the endogenous explanatory variable has a conditional 
normal distribution and (2) the endogenous explanatory variable follows a reduced 
form probit. The reasons for these restrictions will become clear, as we must adapt 
the methods from Section 15.7. 
We can write the model as 


Vin = lz + Vin + ca + uin = OI, Uin | Zi, ca ~ Normal(0, 1), (15.86) 


where y; is the binary response, y,,. is the endogenous explanatory variable, c; is 
the unobserved heterogeneity, {uj : t= 1,...,T} is the sequence of idiosyncratic 
errors, and Z; = (zj1,...,Zir) is the sequence of strictly exogenous variables (condi- 
tional on cj). Notice that u; is assumed to be independent of (z;, ca). We would 
want to include a full set of period dummies in Zy. 
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Our focus will be on estimating APEs. Regardless of the nature of yp, these are 
easily obtained from the ASF. At time ¢, 


ASF(2n, y2) = Ee, [®(Z161 + aye + ca)), (15.87) 


where E., (+) denotes the expected value with respect to the distribution of cj. As we 
did previously, we specify a model for the conditional distribution D(c; | z;), the 
distribution conditional only on the strictly exogenous explanatory variables. Not 
suprisingly, a normal distribution is convenient: 


cy = Yi +76, + ai, dil | z; ~ Normal(0 o? ), (15.88) 


“ay 


where Z; contains the time averages of all strictly exogenous variables (except any 
aggregate time effects, such as period dummies). In particular, if at time £ we have 
Zit = (Zin, Zin), then the time averages of Zin are also in equation (15.88). Plugging 
into (15.86) and doing simple algebra gives 


Yin = l[zind) + yin + Yi + Zi, + an + uin = 0] 


= 1[Zi ða + Xal Vit2 F Wal T ZiČal + ein = Oj, (15.89) 


where e = (an + win) /(1 + 02 yi/ ? has a standard normal distribution conditional 
on z;. On the parameters, the “a” subscript denotes division by (1 +2 
ample, dai = 6) /(1 + Gs ) 1/2 Now the average structural function can be obtained as 


a 1/2 
a) x for ex- 
Ec en) (1 [Zda1 + tayo + War + ZiSa + ein = O}} 


= Ez [O (Zza ða T XalYyn + Wal F Zi€ai)|, 


where the equality follows by iterated expectations. It follows that we can con- 
sistently estimate the APEs by averaging out Z; if we can estimate the scaled coef- 
ficients. But these scaled coefficients are precisely what we estimate if we apply one of 
the MLE methods from Section 15.7.2 or 15.7.3, depending on whether y,,. has a 
conditional normal distribution or follows a probit. In the former case, we can write 
a reduced form as 


Yin = 202+ Wo + Ziča + vin, Vm |Z; ~ Normal(0, t2), t= 1,...,T. (15.90) 


Notice that the instruments for y,,. omitted from the estimating equation (15.89) are 
Zin; all elements of Z; are in equations (15.89) and the reduced form (15.90). Then, we 
use pooled MLE directly on equations (15.89) and (15.90), and then compute the 
APEs exactly as we did before, possibly computing them for each time period. (We 
might call this approach pooled IV probit.) Bootstrapping is a simple (though 
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computationally expensive) way to obtain proper standard errors. If y,,. is binary, we 
replace the right hand side of (15.90) with 1[zjd2. + Yə + Zič2 + vin = 0], set t = 1, 
and apply pooled bivariate probit. 

Pooled estimation is very convenient because any routine that allows estimation of 
the models for cross section data can be used for panel data, provided robust stan- 
dard errors and test statistics are computed to account for the neglected time depen- 
dence. A full MLE method, which would account for the presence of a; and assume, 
in addition, that the uj; are serially independent, would be much more computa- 
tionally intensive and would not be robust to serial correlation in the idiosyncratic 
errors. More importantly, MLE jointly across time would require specification of the 
joint distribution of {(ein, vin) : t= 1,..., T}, and not just the bivariate distribution 
of (ei, Vin) for each t. Assuming independence across £ would be unrealistic, and 
allowing for realistic correlation over time would be computationally expensive. 

When y;. is continuous, a control function approach is also available. Using 
manipulations similar to those above and in Section 15.7.2, we can write 


Vin = [Zindgi + Mi Via + Ogivin + Woy + Zičgi + rin = 0), 


where the coefficients have been scaled by a different variance and so we index the 
parameters by “g” just to distinguish them from the previous scaled parameters. 
Now, riz is independent of (z;, yin, vi2) with a standard normal distribution. After 
obtaining the #2 as the residuals from pooled OLS estimation of the reduced form, 
we use pooled probit of yı ON Zii, Vin, Ôin, 1, and z;. A simple test of the null 
hypothesis that y,. is contemporaneously endogenous is obtained by a £ test of 
Ho : 041 = 0, which is directly available from the pooled probit provided we make the 


statistic robust to arbitrary serial dependence. 
15.8.6 Semiparametric Approaches 


Under strict exogeneity of the explanatory variables, it is possible to consistently 
estimate J up to scale under very weak assumptions. Manski (1987) derives an 
objective function that identifies f up to scale in the T = 2 case when e; and e;n in 
the model (15.72) are identically distributed conditional on (xj, xj2,c;) and xj is 
strictly exogenous. The estimator is the maximum score estimator applied to the dif- 
ferences Ay; and Ax;. As in the cross-sectional case, it is not known how to estimate 
the average response probabilities. 

Honoré and Kyriazidou (2000a) show how to estimate the parameters in the 
unobserved effects logit model with a lagged dependent variable and strictly exoge- 
nous explanatory variables without making distributional assumptions about the 
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unobserved effect. Unfortunately, the estimators, which are consistent and asymp- 
totically normal, do not generally converge at the usual VN rate. In addition, as with 
many semiparametric approaches, discrete explanatory variables such as time dum- 
mies are ruled out, and it is not possible to estimate the APEs. See also Arellano and 
Honoré (2001). 

A middle ground between parametrically specifying D(c; | x;) and allowing it to be 
completely unrestricted is to impose substantive assumptions on D(c; | x;) but without 
making parametric assumptions. As a special case of Altonji and Matzkin’s (2005) 
“exchangeability” assumption, we might impose the restriction 


D(c; | x;) = D(c; | x;) (15.91) 


without specifying D(c;|x;). Why is this restriction useful? Consider a general 
specification where the only restriction is strict exogeneity conditional on the 
heterogeneity: 


P( yi, = 1| xi, i) = P( yy = 1| Xi, Ci) = G(X, ci), (15.92) 


where G, is an unknown function taking values in (0,1) and we let c; be a vector of 
unobserved heterogeneity to emphasize the generality of the setup. We allow for a t 
subscript on G as a general way of allowing aggregate time effects (analogous to 
including different period dummies in a linear model, probit, or logit). The ASF at 
time ¢ can be written as 


ASF,(x,) = Ea [G (x; ¢;)] = Ex, {E[G;(x;, ¢;) |%i]} = Ex,[R,(x,, X0], (15.93) 


where R,(x;,X;) = E[G,(x;, ¢;) |X;]. It follows that, given an estimator R,(-,-) of the 
function R,(-,-), the ASF can be estimated as N~! yr i R,(x;, x;), and then we can 
take derivatives or changes with respect to the entries in xz. 

How can we estimate R,(-,-)? This is where assumption (15.91) comes into play. If 
we combine (15.91) and (15.92) we have 


E( Viz | Xi) = ELE( vir | Xi, €) | Xi] = E[G: (xir, €7) | Xi] = [ono dF(c|x;) 


5 | G,(X,¢) UF(€|X,) = R:(x Xi), 


where F(c|x;) denotes the cdf of D(e;|x;) (which can be a discrete, continuous, or 
mixed distribution), the second equality follows from (15.92), the fourth equality 
follows from assumption (15.91), and the last equality follows from the definition 
of R,(-,-). Of course, because E(y;,|x;) depends only on (x;,X;), we must have 
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E( Yi | Xi, Xi) = Ri(Xiz, Xi). Further, {Xy : t= 1,..., T} is assumed to have time vari- 
ation, and so x; and X; can be used as separate regressors even in a fully non- 
parametric setting. 

We do not generally treat nonparametric methods in this text, but the preceding 
discussion suggests some simple yet flexible parametric approaches. The key is that, 
under specification (15.92) with assumption (15.91), we can specify flexible binary 
models for P(y;, = 1| xi:,X;), estimate these models using MLE methods, and then 
average out x; to obtain estimated APEs. Without a very large N or with many ele- 
ments of xy, we probably would economize by assuming that at least some parame- 
ters are constant across t. For example, a flexible probit model is 


P( Yy = 1 | Xit, X;) = P[O, + Xf +X;y + (Xi ® x;)0 + (Xir ® Xn], t=T 
(15.94) 


where the Kronecker products simply mean we include all squares and nonredundant 
interactions among X; and interactions among X; and X;. Aggregate time effects are 
allowed through the 0,, and we can estimate this model by pooled probit, GEE, or 
minimum distance methods. Using a logit instead of a probit would likely change 
very little in terms of estimated partial effects. With large N, one might use a cdf that 
depends on extra parameters. The point here is that, under (15.91) and (15.92), the 
focus on APEs liberates one from having to specify specific functional forms in 
equation (15.92). Because we can only identify APEs anyway, rather than subject- 
specific effects, we might as well start from models for P(y,, = 1 | Xj, X;). 

We can go further. For example, suppose that we think the heterogeneity c; is 
correlated with features of the covariates other than just the time average. Altonji 
and Matzkin (2005) allow for X; in equation (15.91) to be replaced by other functions 
w; of {x;,: f= 1,...,7}, such as sample variances and covariance. These are exam- 
ples of “exchangeable” functions of {x;,:f=1,...,7}—that is, statistics whose 
value is the same regardless of the ordering of the x;,. Nonexchangeable functions can 
be used, too. For example, we might think that ¢; is correlated with individual- 
specific trends, and so we obtain w; as the intercept and slope from the unit-specific 
regressions x; on 1, t, f= 1,..., 7 (for T > 3); we can also add the error variance 
from this individual specific regression if we have sufficient time periods. 

Altonji and Matzkin (2005) focus on what they call the local average response 
(LAR) as opposed to the APE. In specification (15.92), the LAR at x, for a continu- 
ous variable xy is 


[Cee are |x;), (15.95) 


Oxy 
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where H,(c|x;) denotes the cdf of D(c; | x = x+). This is a “local” partial effect be- 
cause it averages out the heterogeneity for the slice of the population given by the 
vector x,. The APE, which by comparison could be called a “global average re- 
sponse,” averages out over the entire distribution of ¢;. 

When D(c; | x;) = D(c; | w;) for w; a function of x;, Altonji and Matzkin show that 
the LAR can be obtained as 


[Aen akw |x,), (15.96) 

OX 
where R(x;,w) = E( Yi | Xie = Xr, Wi = W) and K;(w|x,) is the cdf of D(w; | xi = x+). 
Altonji and Matzkin demonstrate how to estimate the LAR based on nonparametric 
estimation of E(y,|Xir,wi) followed by “local” averaging, that is, averaging 
OE(¥;,| Xi = Xr, Wi)/OXy over observations i with x; “close” to x;. In the binary re- 
sponse context, the expected value is simply P(y;, = 1|xji,w;). Of course, the LAR 
can even be estimated using parametric models such as that in equation (15.94). For 
a continuous xj, we would simply average the derivative of ®[0;+ xB + Xiy + 
(X; O X;)d + (x; @ X;)y] with respect to x; over i with x, “close” to x;. Because 
defining the appropriate notion of “closeness” requires some care, the LAR is more 
difficult to estimate than the APE, but LAR is perhaps more relevant because it is the 
average response to an exogenous change in x, for units already starting from x;. See 
Altonji and Matzkin (2005) for further discussion concerning identification and esti- 
mation of LARs. 

An interesting possibility is to combine the Blundell and Powell (2004) approach to 
endogeneity with the Altonji and Matzkin (2005) approach to unobserved heteroge- 
neity, resulting in semiparametric versions of the methods in Section 15.8.5. 


Problems 


15.1. Suppose that y is a binary outcome and d1,d2,...,dM are dummy variables 
for exhaustive and mutually exclusive categories; that is, each person in the popula- 
tion falls into one and only one category. 


a. Show that the fitted values from the regression (without an intercept) 
y; on d1;,d2;,...,dM; P= 1; 2. 0cigV 


are always in the unit interval. In particular, carefully describe the coefficient on each 
dummy variable and the fitted value for each i. 
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b. What happens if y; is regressed on M linearly independent, linear combinations of 
dl;,...,dM;, for example, 1, d2;,d3;,...,dM;? 


15.2. Suppose that family į chooses annual consumption c; (in dollars) and chari- 
table contributions q; (in dollars) to solve the problem 


max c+ ai log(1 + q) 
cq 


subject to c+ p;q < mi, cq 20 


where m; is income of family i, p; is the price of one dollar of charitable contributions 
—where p; < 1 because of the tax deductability of charitable contributions, and this 
price differs across families because of different marginal tax rates and different state 
tax codes—and a; > 0 determines the marginal utility of charitable contributions. 


Take m; and p; as exogenous to the family in this problem. 
a. Show that the optimal solution is q; = 0 if a; < p; and q; = a;/p; — 1 if a; > pi 


b. Define y; = 1 if q; > 0 and y; = 0 if q; = 0, and suppose that a; = exp(ziy + vi), 
where z; is a J-vector of observable family traits and v; is unobservable. Assume that 
v; is independent of (z;,m;, p;) and v;/o has symmetric distribution function G(-), 
where g? = Var(v;). Show that 


P(y; = 1 |z; m, pi) = G{(ziy — log p;)/0] 
so that y; follows an index model. 


15.3. Let zı be a vector of variables, let z2 be a continuous variable, and let dı be a 
dummy variable. 


a. In the model 
P(y = 1| 21,22) = ®(z1ð1 + 7122 + 7225), 


find the partial effect of z. on the response probability. How would you estimate this 
partial effect? 


b. In the model 
P(y = 1| 21,22, d1) = (21d) + 122 + y2d1 + 7322d1), 


find the partial effect of z2. How would you measure the effect of dı on the response 
probability? How would you estimate these effects? 


c. Describe how you would obtain the standard errors of the estimated partial effects 
from parts a and b. 
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15.4. Evaluate the following statement: “Estimation of a linear probability model is 
more robust than probit or logit because the LPM does not assume homoskedasticity 
or a distributional assumption.” 


15.5. Consider the probit model 
P(y = 1 |z, q) = ®(z1ô1 + 71224), 
where q is independent of z and distributed as Normal(0, 1); the vector z is observed 
but the scalar q is not. 
a. Find the partial effect of z) on the response probability, namely, 
oP(y =1|z,4) 
ÔZ2 i 
b. Show that P(y = 1 |z) = ®ļz1ôı /(1 + y2z3)'”]. 
c. Define p, = y?. How would you test Ho : p; = 0? 
d. If you have reason to believe p} > 0, how would you estimate 6; along with p,? 
15.6. Consider taking a large random sample of workers at a given point in time. 
Let sick; = 1 if person i called in sick during the last 90 days, and zero otherwise. Let 


z; be a vector of individual and employer characteristics. Let cigs; be the number of 
cigarettes individual i smokes per day (on average). 


a. Explain the underlying experiment of interest when we want to examine the effects 
of cigarette smoking on workdays lost. 


b. Why might cigs; be correlated with unobservables affecting sick;? 


c. One way to write the model of interest is 
P(sick = 1 |z, cigs, qı) = (2d) + yıcigs + 41), 


where z; is a subset of z and q; is an unobservable variable that is possibly correlated 
with cigs. What happens if qı is ignored and you estimate the probit of sick on z,, 
cigs? 

d. Can cigs have a conditional normal distribution in the population? Explain. 


e. Explain how to test whether cigs is exogenous. Does this test rely on cigs having a 
conditional normal distribution? 

f. Suppose that some of the workers live in states that recently implemented no- 
smoking laws in the workplace. Does the presence of the new laws suggest a good IV 
candidate for cigs? 
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15.7. Use the data in GROGGER.RAW for this question. 


a. Define a binary variable, say arr86, equal to unity if a man was arrested at least 
once during 1986, and zero otherwise. Estimate an LPM relating arr86 to pcnv, avg- 
sen, tottime, ptime&6, inc86, black, hispan, and born60. Report the usual and hetero- 
skedasticity-robust standard errors. What is the estimated effect on the probability of 
arrest if pcnv goes from .25 to .75? 


b. Test the joint significance of avgsen and tottime, using a nonrobust and robust 
test. 


c. Now estimate the model by probit. At the average values of avgsen, tottime, inc86, 
and ptime86 in the sample, and with black = 1, hispan = 0, and born60 = 1, what is 
the estimated effect on the probability of arrest if pcnv goes from .25 to .75? Compare 
this result with the answer from part a. 

d. For the probit model estimated in part c, obtain the percent correctly predicted. 
What is the percent correctly predicted when narr&6 = 0? When narr86 = 1? What do 
you make of these findings? 

e. In the probit model, add the terms penv?, ptime86*, and inc86? to the model. Are 
these individually or jointly significant? Describe the estimated relationship between 
the probability of arrest and penv. In particular, at what point does the probability of 
conviction have a negative effect on probability of arrest? 


15.8. Use the data set BWGHT.RAW for this problem. 


a. Define a binary variable, smokes, if the woman smokes during pregnancy. 
Estimate a probit model relating smokes to motheduc, white, and log(faminc). At 
white = 0 and faminc evaluated at the average in the sample, what is the estimated 
difference in the probability of smoking for a woman with 16 years of education and 
one with 12 years of education? 

b. Do you think faminc is exogenous in the smoking equation? What about motheduc? 
c. Assume that motheduc and white are exogenous in the probit from part a. Also 
assume that fatheduc is exogenous to this equation. Estimate the reduced form for 
log( faminc) to see if fatheduc is partially correlated with log(faminc). 


d. Test the null hypothesis that log(faminc) is exogenous in the probit from part a. 


15.9. Assume that the binary variable y follows an LPM. 
a. Write down the log-likelihood function for observation i. 
b. Why might MLE of the LPM be difficult? 
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c. Assuming that you can estimate the LPM by MLE, explain why it is valid, as a 
model selection device, to compare the log likelihood from the LPM with that from 
logit or probit. 


15.10. Suppose you wish to use goodness-of-fit measures to compare the LPM with 
a model such as logit or probit, after estimating the LPM by ordinary least squares. 
The usual R-squared from OLS estimation measures the proportion of the variance 
in y that is explained by P(y = 1|x) = xf. 

a. Explain how to obtain a comparable R-squared measured for the general index 
model P(y = 1|x) = G(x). 

b. Compute the R-squared measures using the data in GROGGER.RAW, where the 
dependent variable is arr86 and the explanatory variables are penv, penv?, avgsen, 
tottime, ptime86, ptime86*, inc86, inc86*, black, hispan, and born60. Are the R- 
squareds substantially different? 


15.11. List assumptions under which the pooled probit estimator is a conditional 
MLE based on the distribution of y, given x;, where y, is the T x 1 vector of binary 
outcomes and x; is the vector of all explanatory variables across all T time periods. 


15.12. Find P(y,;, = 1, ya = 0, yz = 0 | x;, ci, 4; = 1) in the fixed effects logit model 
with T = 3. 


15.13. Suppose that you have a control group, A, and a treatment group, B, and 
two periods of data. Between the two years, a new policy is implemented that affects 
group B; see Section 6.5. 

a. If your outcome variable is binary (for example, an employment indicator), and 
you have no covariates, how would you estimate the effect of the policy? 

b. If you have covariates, write down a probit model that allows you to estimate the 
effect of the policy change. Explain in detail how you would estimate this effect. 


c. How would you get an asymptotic 95 percent confidence interval for the estimate 
in part b? 


15.14. Consider the binary response model with interactions between the (continu- 
ous) endogenous explanatory variable and the exogenous variables: 


y= 1[z\6, + 9% a, + uy > 0}, 


and assume that assumption (15.40) holds with (u1, v2) jointly normal and indepen- 
dent of z, with mean zero and Var(u;) = 1. 


640 Chapter 15 


a. Find P(y,; = 1|z) and conclude that it has the form of a particular heteroskedastic 
probit model. 

b. Consider the following two-step procedure. In the first step, regress yp on z; and 
obtain the fitted values, ,,. In the second step, estimate probit of ya on Za, ÎpZa. 
How come this does not produce consistent estimators of d; and a? 


c. How would you consistently estimate 6, and a? 


15.15. Consider the problem of obtaining standard errors for the coefficient and 
APE estimators in two-step control function estimation of binary response models. 
For simplicity, let y, be a continuous, univariate endogenous variable following the 
reduced form y, = zô2 + v2, yı = 1[xif; + u > 0], xı = gı (zı, y2) is any function 
of the exogenous and endogenous variables. As usual, we assume that z = (z1, Z2), 
where one element of z2 has nonzero element of ô2. Assume that the standard Rivers- 
Vuong assumptions hold, so that 


P(y, = 1 |z, y2) = (x1, + Opi02), 


where f, and 6,1 are the scaled coefficients that appear in the APEs. For simplicity, 
lety = (Bri 0,1)’, let }, be the second-step Rivers-Vuong estimator, and let ô» be the 
first-stage OLS estimator. 


a. Find Avar VN(9, — yı) using the material in Section 12.4.2. 


b. Show that the asymptotic variance that properly accounts for the first-step esti- 
mation of ô is no smaller (in the matrix sense) than the asymptotic variance that 
ignores the estimation error in 6). 


c. Define the vector of estimated average partial effects (based on partial derivatives) 
as 


N 
i = pn Yom.) ĵi, 
i=1 


where W,; = (Xj, 62). Find Avar VN (ù, — ,). It may help to use Problem 12.17. 


d. Show how to consistently estimate the asymptotic variance in part c. 
15.16. Suppose that the binary variable y follows the model 
P(y = 1x) = 1 — [1 + exp(xp)]%, 


where « > 0 is a parameter that can be estimated along with the vector f. The 1 x K 
vector x contains unity as its first element. This model is sometimes called the skewed 
logit model. Note that « = 1 corresponds to the usual logit model. 
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a. For a continuous variable xx, find the partial effect of xg on P(y = 1|x). 
b. Write down the log-likelihood function for a random draw i. 


c. Use the data from MROZ.RAW to estimate the model, using the same explana- 
tory variables in Table 15.1. Using a standard ¢ test, can you reject Ho : log(a) = 0? 


d. Compute the likelihood ratio test of Ho : x = 1. How does this compare with the 
test in part c? 


e. Overall, would you say that the more complicated model is justified? What other 
statistics might you compute to support your answer? 


15.17. In Example 15.4, add samesex as an explanatory variable to the worked 
equation in bivariate probit, and include samesex in the morekids probit, too. 


a. What is the estimated APE with respect to morekids, and how does it compare 
with those in Table 15.2? 


b. Is samesex significant in the worked equation? 


15.18. Consider Chamberlain’s random effects probit model under assumptions 
(15.65), (15.66), and (15.73), but replace assumption (15.67) with 


ci |x; ~ Normal|y + X;€, 0? exp(x;A)], 


so that c; given X; has exponential heteroskedasticity. 

a. Find P(y;, = 1|x;,a;), where a; = c; — E(c;| x;). Does this probability differ from 
the probability under assumption (15.73)? Explain. 

b. Derive the log-likelihood function by first finding the density of (ya,..-, Yir) 
given x;. Does it have similarities with the log-likelihood function under assumption 
(15.73)? 

c. Assuming you have estimated £, Y, é, aż, and 2 by CMLE, how would you esti- 


mate the average partial effects? {Hint: First show that E[®(x°f + Y + X;č + ai) | xi] 
= O({x°B+W+xé}/{1+ o exp(x;4)}!/”), and then use the appropriate average 


across i.} 


15.19. Use the data in KEANE.RAW for this question, and restrict your attention 
to black men who are in the sample all 11 years. 

a. Use pooled probit to estimate the model P(employ;, = 1 |employi 1) = 
(ôo + pemployi 1). What assumption is needed to ensure that the usual standard 
errors and test statistics from pooled probit are asymptotically valid? 

b. Estimate P(employ, = 1 | employ,_; = 1) and P(employ, = 1 | employ,_; = 0). Ex- 
plain how you would obtain standard errors of these estimates. 
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c. Add a full set of year dummies to the analysis in part a, and estimate the proba- 
bilities in part b for 1987. Are there important differences with the estimates in part b? 
d. Now estimate a dynamic unobserved effects model using the method described in 
Section 15.8.4. In particular, add employ; gı as an additional explanatory variable, 
and use random effects probit software. Use a full set of year dummies. 

e. Is there evidence of state dependence, conditional on c;? Explain. 


f. Average the estimated probabilities across employ; gı to get the average partial 
effect for 1987. Compare the estimates with the effects estimated in part c. 


l 6 Multinomial and Ordered Response Models 


16.1 Introduction 


In this chapter we consider discrete response models with more than two outcomes. 
Most applications fall into one of two categories. The first is an unordered response, 
sometimes called a nominal response, where the values attached to different outcomes 
are arbitrary and have no effect on estimation, inference, or interpretation. Examples 
of unordered responses include occupational choice, health plan choice, and trans- 
portation mode for commuting to work. For example, if there are four health plans 
to choose from, we might label these 0, 1, 2, and 3—or 100, 200, 300, 400—and it 
does not matter which plan we assign to which number (provided, of course, that we 
use the same labels across across all observations). In Section 16.2 we cover the gen- 
eral class of multinomial response models, which can be used to analyze unordered 
responses. 

For an ordered response, the values we assign to each outcome are no longer arbi- 
trary, although the magnitudes (usually) are. An example of an ordered response is a 
credit rating, where there are seven possible ratings. We might assign each person a 
rating in the set {0,1,2,3,4,5,6}, where zero is the lowest rating and six is the high- 
est. The fact that five is a better rating than four conveys important information, but 
nothing is lost if we use another set of numbers to denote credit rating—provided 
they maintain the same ordering. We treat ordered response models in Section 16.3. 

In addition to covering standard methods devised for cross section data with 
exogenous explanatory variables, we discuss some simple strategies for handling 
roughly continuous endogenous explanatory variables as well as unobserved hetero- 
geneity in a panel data context. 


16.2 Multinomial Response Models 


16.2.1 Multinomial Logit 


The first model we cover applies when a unit’s response or choice depends on indi- 
vidual characteristics of the unit—but not on attributes of the choices. Thus, it makes 
sense to define the model in terms of random variables representing the underlying 
population. Let y denote a random variable taking on the values {0,1,...,/} for Ja 
positive integer, and let x denote a set of conditioning variables. For example, if y 
denotes occupational choice, x can contain things like education, age, gender, race, 
and marital status. As usual, (x;, y;) is a random draw from the population. 

As in the binary response case, we are interested in how ceteris paribus changes in 
the elements of x affect the response probabilities, P(y = j|x), 7 =0,1,2,...,J. 
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Since the probabilities must sum to unity, P(y = 0|x) is determined once we know 
the probabilities for j = 1,...,J. 

Let x be a 1 x K vector with first-element unity. The multinomial logit (MNL) 
model has response probabilities 


P(y = j| x) = exp(xB;)/ 


J 
1+5 ests) j=l,...,J, (16.1) 


h=1 


where f; is K x 1, j= 1,...,J. Because the response probabilities must sum to unity, 


J 
1+ 5 eo) 3 


h=1 


P(y =0|x) =1/ 


When J = 1, $; is the K x 1 vector of unknown parameters, and we get the binary 
logit model. 
The partial effects for this model are complicated. For continuous xg, we can write 


J 
Da =P(y=j| of = pa eos) las)} (16.2) 
wus h=1 
where f,, is the kth element of $, and g(x, B) = 1 + Sy exp(xf,,). Equation (16.2) 
shows that even the direction of the effect is not determined entirely by f;,. A simpler 
interpretation of p, is given by 


P;(X, B)/Po(x, B) = exp(xf)), j=1,2,...,J, (16.3) 


where p;(x,#) denotes the response probability in equation (16.1). Thus the change 
in p;(x,B)/po(x,B) is approximately pp exp(xf;)Ax; for roughly continuous xp. 
Equivalently, the log-odds ratio is linear in x: log| p;(x, B)/Po(x, B)] = xB;. This result 
extends to general j and A: log| p;(x, B)/py,(x, B)] = x(B; — Bhn). 

Here is another useful fact about the multinomial logit model. Since P(y = j or 


y=h|x) = p;(x,B) + Pix, b), 
P(y = j| y= jor y=h,x) = p;(x, p) /[p;x, B) + Pi, B)] = AxB; — b), 


where A(-) is the logistic function. In other words, conditional on the choice being 
either j or h, the probability that the outcome is j follows a standard logit model with 
parameter vector B; — p}. 

Since we have fully specified the density of y given x, estimation of the MNL 
model is best carried out by maximum likelihood. For each i the conditional log 
likelihood can be written as 
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J 
AB) =X ly; = jl loglp,(xi, BY], 


j=0 


where the indicator function selects out the appropriate response probability for 
each observation i. As usual, we estimate f by maximizing >, 4(B). McFadden 
(1974) has shown that the log-likelihood function is globally concave, and this fact 
makes the maximization problem straightforward. The conditions needed to apply 
Theorems 13.1 and 13.2 for consistency and asymptotic normality are broadly appli- 
cable; see McFadden (1984). 


Example 16.1 (School and Employment Decisions for Young Men): The data 
KEANE.RAW (a subset from Keane and Wolpin, 1997) contains employment and 
schooling history for a sample of men for the years 1981 to 1987. We use the data for 
1987. The three possible outcomes are enrolled in school (status = 0), not in school 
and not working (status = 1), and working (status = 2). The explanatory variables 
are education, a quadratic in past work experience, and a black binary indicator. The 
base category is enrolled in school. Out of 1,717 observations, 99 are enrolled in 
school, 332 are at home, and 1,286 are working. The results are given in Table 16.1. 

Another year of education reduces the log-odds between at home and enrolled in 
school by —.674, and the log-odds between at home and enrolled in school is .813 


Table 16.1 
Multinomial Logit Estimates of School and Labor Market Decisions 


Dependent Variable: status 


Home Work 
Explanatory Variable (status = 1) (status = 2) 
educ —.674 —.315 
(.070) (.065) 
exper —.106 849 
(.173) (.157) 
exper? —.013 —.077 
(.025) (.023) 
black 813 311 
(.303) (.282) 
constant 10.28 5.54 
(1.13) (1.09) 
Number of observations 1,717 
Percent correctly predicted 79.6 
Log-likelihood value —907.86 


Pseudo-R-squared .243 
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higher for black men. The magnitudes of these coefficients are difficult to interpret. 
Instead, we can either compute partial effects, as in equation (16.2), or compute dif- 
ferences in probabilities. For example, consider two black men, each with five years of 
experience. A black man with 16 years of education has an employment probability 
that is .042 higher than a man with 12 years of education, and the at-home proba- 
bility is .072 lower. (Necessarily, the in-school probability is .030 higher for the man 
with 16 years of education.) These results are easily obtained by comparing fitted 
probabilities after multinomial logit estimation. 

The experience terms are each insignificant in the home column, but the Wald test 
for joint significance of exper and exper? gives p-value = .047, and so they are jointly 
significant at the 5 percent level. We would probably leave their coefficients un- 
restricted in f, rather than setting them to zero. 


The fitted probabilities can be used for prediction purposes: for each observation i, 
the outcome with the highest estimated probability is the predicted outcome. This can 
be used to obtain a percent correctly predicted, by category if desired. For the pre- 
vious example, the overall percent correctly predicted is almost 80 percent, but the 
model does a much better job of predicting that a man is employed (95.2 percent 
correct) than in school (12.1 percent) or at home (39.2 percent). 


16.2.2 Probabilistic Choice Models 


McFadden (1974) showed that a model closely related to the MNL model can be 
obtained from an underlying utility comparison. Suppose that, for a random draw i 
from the underlying population (usually, but not necessarily, individuals), the utility 
from choosing alternative j is 


where aj, j= 0,1,2,...,J are unobservables affecting tastes. Here, xj is a 1 x K 
vector that differs across alternatives and possibly across individuals as well. For ex- 
ample, x; might contain the commute time for individual i using transportation 
mode j, or the co-payment required by health insurance plan j (which may or may 
not differ by individual). For reasons we will see, x; cannot contain elements that 
vary only across i and not j; in particular, x;; does not contain unity. We assume that 
the (J + 1)-vector a; is independent of x;, which contains {xj: j = 0,..., J}. 
Let y; denote the choice of individual 7 that maximizes utility: 


Yi = argmax( yio, Yh, Seo Viz) 
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so that y; takes on a value in {0, 1,..., J}. As shown by McFadden (1974), if the aj, 
j=0,...,J are independently distributed with cdf F(a) = exp[—exp(—a)]—the type 
I extreme value distribution—then 


J 

P(y; = j| Xi) = exp(xyB)/ 5 eos »  J=0,...,7. (16.5) 
h=0 

The response probabilities in equation (16.5) constitute what is usually called the 

conditional logit (CL) model. Dropping the subscript i and differentiating shows that 

the marginal effects are given by 


Opj(x)/Oxj = p(X) — p(x)|P., J=0,...,J, k= borg k (16.6) 
and 
Op;(x)/OXnk = —P;(X) Pi(X)Bes J # h, k= l; opk ,K, (16.7) 


where p;(x) is the response probability in equation (16.5) and f, is the kth element of 
B. As usual, if the x; contain nonlinear functions of underlying explanatory variables, 
this fact will be reflected in the partial derivatives. 

The CL and MNL models have similar response probabilities, but they differ in 
some important respects. In the MNL model, the conditioning variables do not 
change across alternative: for each i, x; contains variables specific to the individual 
but not to the alternatives. This model is appropriate for problems where character- 
istics of the alternatives are unimportant or are not of interest, or where the data are 
simply not available. For example, in a model of occupational choice, we do not 
usually know how much someone could make in every occupation. What we can 
usually collect data on are things that affect individual productivity and tastes, such 
as education and past experience. The MNL model allows these characteristics to 
have different effects on the relative probabilities between any two choices. 

The CL model is intended specifically for problems where consumer or firm choices 
are at least partly based on observable attributes of each alternative. The utility level 
of each choice is assumed to be a linear function in choice attributes, x, with com- 
mon parameter vector f. This turns out to actually contain the MNL model as a 
special case by appropriately choosing x;. Suppose that w; is a vector of individual 
characteristics and that P(y; = j|w;) follows the MNL in equation (16.1) with 
parameters 6;, j= 1,...,J. We can cast this model as the CL model by defining 
xj = (d1jw;, d2;wi,...,dJ;wi), where dj, is a dummy variable equal to unity when 
j=h, and $ = (6;,...,6,)’. Consequently, some authors refer to the CL model as 
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the MNL model, with the understanding that alternative-specific characteristics are 
allowed in the response probability. 

Empirical applications of the CL model often include individual-specific vari- 
ables by allowing them to have separate effects on the latent utilities. A general model 
is 


Yy = Zy + Wid; + dij, WO Deere (16.8) 


with ôo = 0 as a normalization, where z varies across j and possibly i. If 6; = ô for 
all j, then w;ô drops out of all response probabilities. The model with both kinds of 
explanatory variables in (16.8) is called the mixed logit model. 

The CL model is very convenient for modeling probabilistic choice, but it has some 
limitations. An important restriction 1s 


P)(X)/Pi(Xh) = expOB)/exp(xnB) = exp[(x; — xn) B}, (16.9) 


so that relative probabilities for any two alternatives depend only on the attributes of 
those two alternatives. This is called the independence from irrelevant alternatives 
(IIA) assumption because it implies that adding another alternative or changing the 
characteristics of a third alternative does not affect the relative odds between alter- 
natives j and h. This implication is often implausible, especially for applications with 
similar alternatives. A well-known example is due to McFadden (1974). Consider 
commuters initially choosing between two modes of transportation, car and red bus. 
Suppose that a consumer chooses between the buses with equal probability, .5, so 
that the ratio in equation (16.9) is unity. Now suppose a third mode, blue bus, is 
added. Assuming bus commuters do not care about the color of the bus, consumers 
will choose between these with equal probability. But then IIA implies that the 
probability of each mode is 4; therefore, the fraction of commuters taking a car 
would fall from 4 to 4, a result that is not very realistic. This example is admittedly 
extreme—in practice, we would lump the blue bus and red bus into the same cate- 
gory, provided there are no other differences—but it indicates that the IIA property 
can impose unwanted restrictions in the conditional logit model. 

Hausman and McFadden (1984) offer tests of the IIA assumption based on 
the observation that, if the CL model is true, f can be consistently estimated by 
conditional logit by focusing on any subset of alternatives. They apply the Hausman 
principle, which compares the estimate of f using all alternatives to the estimate us- 
ing a subset of alternatives. 

Several models that relax the ITA assumption have been suggested. In the context 
of the random utility model, the ITA assumption comes about because the {aj : j = 
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0,1,...,J} are assumed to be independent Wiebull random variables. A more flexi- 
ble assumption is that a; has a multivariate normal distribution with arbitrary corre- 
lations between ay and aj, all j # h. The resulting model is called the multinomial 
probit model. (In keeping with the spirit of the previous names, conditional probit 
model is a better name, which is used by Hausman and Wise (1978) but not by many 
others.) 

Theoretically, the multinomial probit model is attractive, but it has some practical 
limitations. The response probabilities are very complicated, involving a (J + 1)- 
dimensional integral. This complexity not only makes it difficult to obtain the partial 
effects on the response probabilities, it also makes maximum likelihood estimation 
(MLE) infeasible for more than about five alternatives. For details, see Maddala 
(1983, Chap. 3) and Amemiya (1985, Chap. 9). Hausman and Wise (1978) contain an 
application to transportation mode for three alternatives. 

Recent advances in estimation through simulation make multinomial probit esti- 
mation feasible for many alternatives. See Hajivassilou and Ruud (1994) and Keane 
(1993) for recent surveys of simulation estimation. Keane and Moffitt (1998) apply 
simulation methods to structural multinomial response models, where the econometric 
model is obtained from utility maximization subject to constraints. Keane and Moffitt 
study the tax effects of labor force participation allowing for participation in multiple 
welfare programs. 

A different approach to relaxing IIA is to specify a hierarchical model. The most 
popular of these is called the nested logit model. McFadden (1984) gives a detailed 
treatment of these and other models; here we illustrate the basic approach where 
there are only two hierarchies. 

Suppose that the total number of alternatives can be put into S groups of similar 
alternatives, and let G, denote the alternatives within group s. Thus the first hierarchy 
corresponds to which of the S groups y falls into, and the second corresponds to the 
actual alternative within each group. McFadden (1981) studied the model 


P(y € G, |x) -fa ext o| Y yea] eo axo] | (16.10) 
JEG, JEG, 


and 
heG. 


P(y = j| y € Gs, x) = exp(p, sani exp(p, ww} (16.11) 


where equation (16.10) is defined for s = 1,2,...,.S while equation (16.11) is defined 
for j € G, and s = 1,2,...,S; of course, if j ¢ Gs, P(y = j| y € G,, x) = 0. This model 
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requires a normalization restriction, usually «~; = 1. Equation (16.10) gives the prob- 
ability that the outcome is in group s (conditional on x); then, conditional on y € Gs, 
equation (16.11) gives the probability of choosing alternative j within G,. The re- 
sponse probability P(y = j |x), which is ultimately of interest, is obtained by multi- 
plying equations (16.10) and (16.11). This model can be derived by specifying a 
particular joint distribution for a; in equation (16.4); see Amemiya (1985, p. 303). 
Equation (16.11) implies that, conditional on choosing group s, the response 
probabilities take a CL form with parameter vector p,'f. This suggests a natural 
two-step estimation procedure. First, estimate 4, = p,'B, s = 1,2,...,S, by applying 
CL analysis een to each of the groups. Then, plug the As into equation (16.10) 


and estimate &s, $ = .,S and p,, S= 1,...,S by maximizing the log-likelihood 
function 

S 
55 ily. ; € Gs] logigs(xi; â, a, p)], 


pan 


i=l s= 


where q,(x;4,a,p) is the probability in equation (16.10) with 4, = p7!£. This two- 
step conditional MLE is consistent and /N-asymptotically normal under general 
conditions, but the asymptotic variance needs to be adjusted for the first-stage esti- 
mation of the A,; see Chapters 12 and 13 for more on two-step estimators. 

Of course, we can also use full MLE. The log likelihood for observation i can be 
written as 


S 
4(B, 4, p) = Sfi e af iosa (xi B,a,p)] + D> 1y; = j] loglpy (xe B, ps i} 


=l JEG, 
(16.12) 


where qs(X;;ĝ,a,p) is the probability in equation (16.10) and p,(x;;B,p,) is the 
probability in equation (16.11). The regularity conditions for MLE are satisfied under 
weak assumptions. 

When g, = 1 and p, = 1 for all s, the nested logit model reduces to the CL model. 
Thus, a test of IIA (as well as the other assumptions underlying the CL model) is 
a test of Ho : %2 = -+ = s =p, =`: = ps = 1. McFadden (1987) suggests a score 
test, which only requires estimation of the CL model. 

Often special cases of the model are used, such as setting each «, to unity and 
estimating the p,. In his study of community choice and type of dwelling within a 
community, McFadden (1978) imposes this restriction along with p, = p for all s, so 
that the model has only one more parameter than the CL model. This approach 
allows for correlation among the a; for j belonging to the same community group, 
but the correlation is assumed to be the same for all communities. 
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The nested logit model allows different explanatory variables, which can also differ 
by alternative, to appear in the different levels. For example, one set of variables may 
be relevant for community choice, such as quality of schools, crime rates, property 
tax rates, and other measures of community quality. Another set of variables would 
affect the kind of dwelling, such as the prices of the different kinds of dwellings. 
Higher-level nested logit models are covered in McFadden (1984) and Amemiya 
(1985, Chap. 9). 

A different approach to modifying CL to arrive at a model that relaxes IIA is to 
include unobserved heterogeneity in the model—almost always assumed independent 
of the covariates in cross section applications—and then integrate it out. If c; repre- 
sents scalar heterogeneity, we can extend equation (6.8) to 


Viz = XP + ci + Oy, J= Qed; (16.13) 


where {aj : j = 0,1,..., J} conditional on (x;,c;) are independent, identically dis- 
tributed (i.i.d.) with the type I extreme value distribution, then the presence of c; 
allows correlation in the utilities of the different choices conditional only on x;. Typ- 
ically, c; is assumed independent of x; with a normal distribution. If c; has a discrete 
distribution with a known number of support points, then the resulting model is 
usually called a mixture model or latent class model, the idea being that each cross 
section unit į belongs to an unobserved, or latent, class. Swait (2003) provides a re- 
cent example, where (ajo,...,ai;) has the generalized extreme value distribution, 
which contains the CL model and nested logit model as special cases. In addition to 
discussing parameter estimates, Swait shows how to obtain partial effects after aver- 
aging out the heterogeneity. 

We can go even further by replacing £$ in equation (16.13) with a random coeff- 
cient, say b;, where E(b;) = £. In fact, b; is typically assumed to be independent of x; 
with a multivariate normal distribution. Such models can add considerable flexibility 
to the standard models, but they are more difficult to estimate. McFadden and Train 
(2000) provide a comprehensive treatment, and Cameron and Trividi (2005) provide 
an overview and additional references. 


16.2.3 Endogenous Explanatory Variables 


Because explanatory variables can be correlated with unobservables, it is natural to 
study multinomial response models with endogenous explanatory variables. Rather 
than start with a linear index with an additive error, as in the derivation of the CL 
model, we can start with a model for P(ya = j | Za, Yp;ra), 7 = 9,1,...,J, where rj 
represents unobserved, omitted factors thought to be correlated with the vector yp. 
(The vector z;, and even yp, can contain variables that change with j, as well as 
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those that change only with 7. That is, the exogenous and endogenous variables can 
be specific to the alternative.) Then, we would use a reduced form for yp, say Yp = 
zilli? + v2, assume that (rj, vj2) is independent of z;, and that (rj, vj2) has a conve- 
nient distribution, such as multivariate normal. This approach is similar to the con- 
trol function approach for probit we covered in Section 15.7.2. 

In fact, the approach described in the previous paragraph has been applied when 
the response probabilities have the CL form. For example, Villas-Boas and Winer 
(1999) apply this approach to modeling brand choice, where prices are allowed to 
correlate with unobserved tastes that affect brand choice. The problem in starting 
with an MNL or CL model for P(ya = j| Zi, y,2, 711) in implementing the control 
function approach is computational: just as the binary logit model does not mix well 
with normally distributed unobservables, neither does the CL model. Nevertheless, 
estimation is possible, particular if one uses simulation methods of estimation briefly 
mentioned in the previous subsection. 

A much simpler control function approach is obtained if we skip the step of mod- 
eling P(y,, = J | Zi, Yp, rn) and jump directly to convenient models for P(y;, = J | Zi, 
Yn, V2) = P(ya = J| Z; Y2). Kuksov and Villas-Boas (2008) and Petrin and Train 
(2010) are proponents of this solution. The idea is that any parametric model for 
P(ya = /|Za,Y2,%i) is essentially arbitrary, so, if we can recover quantities of 
interest directly from P(y,, = j| Za, Yp,;V2), why not specify these probabilities 
directly? If we assume that D(rj | z; Y2) = D(ri | v2), and that P(y, = j | Zi, Yn, Vi2) 
can be obtained from P(y,, = j|Zi1,Y,,7i1) by integrating the latter with respect to 
D(rj | v2), then we can apply the results on average partial effects (APEs) that we 
used in Sections 15.7.2 (because P(y,, = J | Za, Yp; ra) = EC [ya = J] | Za, Yp, ra) isa 
conditional expectation). The weakness of this approach is that it implicitly main- 
tains that, say, an MNL model for P(y,;; = j | Zi, Yp, V2) is consistent with underly- 
ing specifications for P(ya = j| Za, Yyp,ra) and D(ra |v), and those underlying 
specifications would not have recognizable forms. 

Once we have selected a model for P(y; = j | Zi, Yn, Viz), which could be CL or 
nested logit, we can apply a simple two-step procedure. First, we estimate the reduced 
form for yp and obtain the residuals, +; = yp — z[b. (Alternatively, we can use 
strictly monotonic transformations of the elements of yp, such as the logarithim if 
Ving > 9 or log[Yng/(1 — Viag)] if 0 < Ving < 1, as dependent variables in the reduced 
forms. As we discussed in the binary response case, using transformations of contin- 
uous endogenous explanatory variables allows for more realistic linear reduced forms 
with additive, independent errors.) Then, we estimate one of the multinomial re- 
sponse models we just covered with explanatory variables zj1, yp, and Vj2. As always 
with control function approaches, we need enough exclusion restrictions in z; to iden- 
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tify the parameters and APEs. We can include nonlinear functions of (Zza, Yp, V2), 
including quadratics and interactions. 

Given estimates of the probabilities P(y, = /|zZi = Z1, Yp = Y2, Vn =V2)= 
P;(Zi, Yo, V2), we can estimate the APEs on the structural probabilities by estimating 
the average structural function (ASF): 


N 
ASF(21,¥2) = N! XC p,(21, Yo, 2). (16.14) 
i=1 


Then, we can take derivatives or changes of ASF (Z1, y2) with respect to elements of 
(z1,Y>), as usual. While the delta method can be used to obtain analytical standard 
errors, the bootstrap is simpler and feasible if one uses, say, CL. 

If we adopt a multinomial probit model as the underlying structural model, then 
we can derive a multinomial probit model in implementing the control function 
approach. But, as in the case with exogenous explanatory variables, such an 
approach is computationally much more difficult. Implementing the bootstrap to 
obtain standard errors for the parameters or APEs would create a computational 
challenge, even with a small number of alternatives. 


16.2.4 Panel Data Methods 


We can use reasoning similar to that in the previous subsection to obtain simple 
strategies for allowing correlated random effects in panel data models. Ideally, we 
would specify a popular model for the response probabilities conditional on un- 
observed heterogeneity. However, for the same reasons discussed in Section 16.2.3, 
such an approach leads to computational difficulties. 

To be more precise about the issues, for each time period t, let y, be an unordered 
response taking values in {0,...,J}, and let x; denote a vector of explanatory vari- 
ables (that includes a constant and can, as usual, include a full set of time dummies 
and even time-constant variables). Letting c; be a vector of unobserved heteogeneity, 
we would like to start by specifying 


PC Vie = j | Xit, C1) = P(Y = j | Xi, C1), (16.15) 
where the equality of the two probabilities means that we assume strict exogeneity of 
{x;,: t= 1,..., T} conditional on c;. If we specify equation (16.15) as, say, an MNL, 


then integrating out ¢;, after specifying D(c; | x;), is typically nontrivial. For example, 
suppose we start with an MNL model with a single source of heterogeneity, 


"i 
1+ XC exp(xirB, + onci) |. (16.16) 
h=1 


PC Vir = J | Xin, ci) = exP(XiBj + se) 
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If we specify 
ci| x; ~ Normal(y + Wé, 02), (16.17) 


where wi; is the subset of x; that changes across i and ¢, then, in principle, we can 
obtain P( y; = j | Xi) = P(yv,, = J | Xi, Wi). But the result is not in closed form and not 
easy to deal with computationally. Simulation methods of estimation of the kind 
mentioned in Section 16.2.2 are usually needed. 

Instead, if we assume only that D(e; | x;) = D(e; | W;), then 


POvin = j |x) = PO = FX Wi),  J=0,..., J; t=1,...,T. (16.18) 


Further, we know that the APEs of P(y,, = j | xi: = x,, ¢;) that is, the partial effects 
with c; averaged out—can be obtained by averaging out W; in P(y,;, = J | Xi = X1, Wi). 
We used this fact several times in Section 15.8; see also Section 2.2.5. Therefore, it 
may be useful to directly specify models for P( y; = J | Xi, W;). A simple approach is 
to specify P(y;, = J | Xi, Wi) as, say, MNL: 


Vit | Xa,- Xir) ~ Multinomial(x;,f, + Wié1,-.-, XB) + WiE,), (16.19) 


where x; is assumed to include an intercept (and probably T — 1 time period dum- 
mies). We can then just apply pooled multinomial logit estimation and include the 
time averages, W;, as additional explanatory variables. Inference should be made 
robust to arbitrary serial dependence because, at a minimum, part of the heteroge- 
neity is still omitted. Plus, the pooled MNL estimator allows any kind of dependence 
in D(yq,---, Vir | Xi,¢7). Of course, this approach imposes IIA conditional on 
(Xir, Wi), but, interestingly, it does not impose IIA on the average response proba- 
bilities—the average structural function, in this case—which we would estimate as 


N 


ASR (x) =N Sf ened +98) / 


i=l 


J 
Lp 5 exp(x,B, + må \ (16.20) 


h=1 


Allowing a mixed model, where some elements of X; are choice specific, causes no 
difficulties. We can use the same strategy applied to nested logit models: we simply 
add the time averages, W;, as additional explanatory variables, used pooled MLE, 
and make inference robust to arbitrary serial dependence. Many econometrics pack- 
ages make this approach very straightforward to implement. 

We gain even more flexibility by allowing x; and wW; to interact in the MNL or CL 
model. In other words, we can add [vec(W; ® xir)|‘A; to the probabilities implicit in 
equation (16.19). Other nonlinear functions, such as general polynomials, are easy to 
include, too. 
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16.3 Ordered Response Models 


16.3.1 Ordered Logit and Ordered Probit 


We now turn to models for ordered responses. Let y be an ordered response taking 
on the values {0,1,2,...,J} for some known integer J. The ordered probit model for 
y (conditional on explanatory variables x) can be derived from a latent variable 
model. Assume that a latent variable y* is determined by 


y*=xß +e, e|x ~ Normal(0, 1) (16.21) 
where f is K x 1 and, for reasons to be seen, x does not contain a constant. Let 
ot) <2 <- < ay, be unknown cut points (or threshold parameters), and define 


y=0 if y* < % 


y=1 if a < y* <a (16.22) 


y=J if y“ > ay 


For example, if y takes on the values 0, 1, and 2, then there are two cut points, «| 
and a. 

Given the standard normal assumption for e, it is straightforward to derive the 
conditional distribution of y given x; we simply compute each response probability: 


P(y = 0|x) = PCy" < o |x) = P(xB + e < o |x) = O(a — xf) 


y= x) = P(a < y* < a |x) = O(a — xf) — O(a — xP) 
(16.23) 


P(y =J —1|x) = Play_-i < y* < ag |x) = O(a — xB) — («zı — xP) 
P(y = J|x) = P(y* > az |x) = 1 — O(a) — xf). 


You can easily verify that these sum to unity. When J = 1 we get the binary probit 
model: P(y = 1|x) = 1 — P(y = 0| x) = 1 — ®(a, — xP) = ®(xf — «ı), and so —a 
is the intercept inside ®. It is for this reason that x does not contain an intercept in 
this formulation of the ordered probit model. (When there are only two outcomes, 
zero and one, we set the single cut point to zero and estimate the intercept; this 
approach leads to the standard probit model.) 
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The parameters a and f can be estimated by MLE. For each i, the log-likelihood 
function is 


,(a,B) = Iy; = O)logl®(a — x,B)] + 1[y; = 1] log(®(o9 — xiB) 
— D(a — xiB)] +--+ + 1y; = J] ogli - O(a — x:B)] (16.24) 


This log-likelihood function is well behaved, and many statistical packages routinely 
estimate ordered probit models. 

Other distribution functions can be used in place of ®. Replacing ® with the logit 
function, A, gives the ordered logit model. In either case we must remember that $, by 
itself, is of limited interest. In most cases we are not interested in E(y* |x) = xf, as 
y* is an abstract construct. Instead, we are interested in the response probabilities 
P(y = j |x), just as in the ordered response case. For the ordered probit model 


Opo(x) /Oxn = —Bi.b(%1 — xB), Ops(x)/Oxx = Prplas — xB) 
Op;(x)/Oxx = Palol- — xB) — ply -= xp), OK<s<J, 


and the formulas for the ordered logit model are similar. In making comparisons 
across different models—in particular, comparing ordered probit and ordered logit— 
we must remember to compare estimated response probabilities at various values of 
x, such as x; the Ê are not directly comparable. In particular, the âj are important 
determinants of the magnitudes of the estimated probabilities and partial effects. 
(Therefore, treatments of ordered probit that refer to the a; as ancillary, or second- 
ary, parameters are misleading.) 

While the direction of the effect of x, on the probabilities P(y =0 |x) and 
P(y = J |x) is unambiguously determined by the sign of f,, the sign of f, does not 
always determine the direction of the effect for the intermediate outcomes, 1,2,..., 
J — 1. To see this point, suppose there are three possible outcomes, 0, 1, and 2, and 
that f, > 0. Then Opo(x)/0x; < 0 and dpo(x)/dx,% > 0, but dpi (x) /0x; could be either 
sign. If ja; — xB| < |%. — x£], the scale factor, (xı — xB) — ¢(a2 — xB), is positive; 
otherwise it is negative. (This conclusion follows because the standard normal pdf is 
symmetric about zero, reaches its maximum at zero, and declines monotonically as 
its argument increases in absolute value.) 

As with multinomial logit, for ordered responses we can compute the percent cor- 
rectly predicted, for each outcome as well as overall: our prediction for y is simply 
the outcome with the highest estimated probability. 

Ordered probit and logit can also be applied when y is given quantitative meaning 
but we wish to acknowledge the discrete, ordered nature of the response. For exam- 
ple, suppose that individuals are asked to give one of three responses on how their 
pension funds are invested: “mostly bonds,” “mixed,” and “mostly stocks.” One 
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possibility is to assign these outcomes as 0, 1, 2 and apply ordered probit or ordered 
logit to estimate the effects of various factors on the probability of each outcome. 
Instead, we could assign the percent invested in stocks as, say, 0, 50, and 100, or 25, 
50, and 75. For estimating the probabilities of each category it is irrelevant how we 
assign the percentages as long as the order is preserved. However, if we give quanti- 
tative meaning to y, the expected value of y has meaning. We have 


E(y |x) = aoP(y = ao |x) + a P(y = ai |x) +--+ azP(y = az |x), (16.25) 


where dy < ad, < --- < ay are the J values taken on by y. Once we have estimated 
the response probabilities by ordered probit or ordered logit, we can easily estimate 
E(y|x) for any value of x, for example, x. Estimates of the expected values can be 
compared at different values of the explanatory variables to obtain partial effects for 
discrete x;. Alternatively, we can compute average partial effects by averaging the 
partial derivatives across i or discrete changes. 


Example 16.2 (Asset Allocation in Pension Plans): The data in PENSION.RAW 
are a subset of data used by Papke (1998) in assessing the impact of allowing indi- 
viduals to choose their own allocations on asset allocation in pension plans. Initially, 
Papke codes the responses “mostly bonds,” “mixed,” and “mostly stocks” as 0, 50, 
and 100, and uses a linear regression model estimated by OLS. The binary explana- 
tory variable choice is unity if the person has choice in how his or her pension fund is 
invested. Controlling for age, education, gender, race, marital status, income (via a set 
of dummy variables), wealth, and whether the plan is profit sharing gives the OLS 
estimate Bujsice = 12.05 (se = 6.30), where N = 194. This result means that, other 
things equal, a person having choice has about 12 percentage points more assets in 
stocks. 

The ordered probit coefficient on choice is .371 (se = .184). The magnitude of the 
ordered probit coefficient does not have a simple interpretation, but its sign and sta- 
tistical significance agree with the linear regression results. (The estimated cut points 
are @, = —3.087 and & = —2.054.) To get an idea of the magnitude of the estimated 
effect of choice on the expected percent in stocks, we can estimate E(y|x) with 
choice = 1 and choice = 0, and obtain the difference. However, we need to choose 
values for the other regressors. For illustration, suppose the person is 60 years old, 
has 13.5 years of education (roughly the averages in the sample), is a single, nonblack 
male, has annual income between $50,000 and $75,000, had wealth in 1989 of 
$200,000 (also close to the sample average), and does not have a profit-sharing plan. 
Then, for choice = 1, E(petstck |x) ~ 40.0, and with choice = 0, E(pctstck |x) x 
28.1. The difference, 11.9, is remarkably close to the linear model estimate of the 
effect on choice. 
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For ordered probit, the percentages correctly predicted for each category are 51.6 
(mostly bonds), 43.1 (mixed), and 37.9 (mostly stocks). The overall percentage cor- 
rectly predicted is about 44.3. 


16.3.2 Specification Issues in Ordered Models 


As with binary probit and logit in Chapter 15, as well as the unordered models dis- 
cussed in Section 16.1, we can investigate the consequences of various specification 
problems with ordered probit and logit. Some writers have focused on the parallel 
regression assumption, which arises because of the underlying latent variable formu- 
lation with a single unobservable independent of the covariates. In particular, if we 
write y* as in equation (16.21), where D(e |x) is standard normal or logistic, then 


P(y < j|x) = P(y* < |x) = G(a — xf), j=0,1,...,J—-1, (16.26) 


where G(-) = ®(-) or G(-) = A(-), and so the probabilities differ across j only be- 
cause of the cut parameters, œj. In effect, an intercept shift inside the nonlinear cdf 
determines the differences in probabilities. A more general specification allows the 
vector f to change across j, too: 


P(y < j|x) = G(a; — xB;), jJ=0,1,...,J-1. (16.27) 


Clearly, equation (16.27) is more general than (16.26), and the vector (o, B;) is easily 
estimated by applying binary reponse MLE to the response variable w; = 1[y; < Jj], 
j=0,...,J— 1. As discussed in Long and Freese (2001), estimation of (16.27) and 
estimation of the ordered response model (which is the restricted model) can be used 
to obtain the LR statistic for the hypothesis that the $, are all equal. 

Testing that the $, are the same in equations (16.27) is certainly a valid specifica- 
tion test of the standard ordered probit or logit model, but it is not clear how to 
proceed if we reject a common f. Presumably, logic dictates whether y is an ordered 
response or an unordered response; if y is an ordered response, the possibility that the 
estimated probabilities P(y < j'|x) are not increasing in j for all values of x—which 
can happen if the 2, are allowed to differ—does not make sense, and it makes little 
sense to estimate an unordered model (such as multinomial logit). Furthermore, a 
statistical rejection need not imply that ordered probit or ordered logit estimates of 
the response probabilities are poor estimates of the true response probabilities. If we 
specify the more general model P(y < j |x) = G(a; — xf;), then we are left with a 
bunch of unconnected binary response models for w; = l[y < j], j=0,...,J—1, 
and it is not clear what we would learn in the end. (Estimating separate binary re- 
sponse models for the w; is how one carries out an LR test of the parallel regression 
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function, where the restricted model is just the usual ordered probit or ordered logit; 
see Long and Freese (2001) for further details.) 
Equation (16.26) is useful in its own right. For a continuous variable xp, 


oP(y < jx) _ 
OXh 


Brg (oj xf), (16.28) 


where g(-) is the density associated with G(-), which means that the signs of the par- 
tial effects on P(y < j|x) are unambiguously determined by the signs of the coef- 
ficients. If f, > 0, an increase in x, decreases the probability that y is less than or 
equal to any value j. 

Other specification issues follow our treatment of binary response models in 
Chapter 15. For reasons discussed in Section 15.7, these are most easily studied in the 
context of ordered probit. For example, if we add a normally distributed unobserved 
heterogeneity term that is independent of x, we consistently estimate the APEs on the 
response probabilities, and expected value, if we simply ignore the heterogeneity. Just 
as in the binary response case, we estimate precisely the quantities of interest when we 
just ignore the heterogeneity. More generally, if we specify P(y; = /|x;,c;) for het- 
erogeneity c; independent of x;, the APEs are obtained by computing partial effects 
on P(y; = j|x;). Therefore, one can argue that we might as well just specify more 
flexible models for P(y; = j | x;). 

One way to specify more flexible models is to introduce heteroskedasticity or non- 
normality in the latent variable equation, just as in the binary response case. So, 
assume in equation (16.21) that 


e|x ~ Normal(0, exp(2x;0)), (16.29) 


where x; is a subset of x (possibly x; = x). The response probabilities are obtained 
by simply multiplying (æ; — xf) everywhere in equations (16.23) by exp(—xid). A 
score test of Ho : ô = 0, or a variable addition version of the test, are straightforward 
to derive; see Problem 16.4. MLE of the unrestricted model is not too difficult, either. 
But, as in the binary response case, we must remember to compare estimated partial 
effects across different models, rather than just coefficient estimates. The same issues 
that arise in the binary case for how to define the APEs arise here, too; see Section 
15.7.4. 

As in the binary response case, it is possible to relax the normality or logistic as- 
sumption on the latent error, e. Again, the key issue in considering such extensions is 
whether implementing them changes the estimated partial effects in important ways. 
As usual, it is not necessary or sufficient to consider only the effects on the estimates 
of B and the gj. 
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16.3.3 Endogenous Explanatory Variables 


Handling one or more continuous endogenous explanatory variables is relatively 
straightforward in ordered probit models, provided we are willing to make dis- 
tributional assumptions on the reduced form. In fact, the Rivers and Vuong (1988) 
approach (see Section 15.7.2) extends immediately to ordered probit. Write the model 
now as 


yi =n10, + yı y2 + u1 (16.30) 
Y2 = 102 + 02, (16.31) 


where (u1, v2) is independent of z and jointly normally distributed. (As in the binary 
case, we can relax these assumptions a bit.) In keeping with the typical ordered probit 
approach, zı does not contain an intercept. Instead, there are cut points, a, j = 
1,...,J. We define the observed ordered response, yı, in terms of the latent response, 
y;, just as in equations (16.22). 

By now the approach should be clear. If we write u; = 0,v2 + e; and plug into 
(16.30) we obtain 


yi = 7,0, + y1 y2 + O02 + e1, (16.32) 


where 01 = 7, /73, nı = Cov(v2,u1), T3 = Var(v2), e1 |Z, v2 ~ Normal(0, 1 — p7), and 
pı = 012 = n? /12. It follows from the standard results on two-step estimation that if 
we obtain the OLS residuals, 02, from the first-stage regression yp on z;,i=1,..., 
N, and then run ordered probit of y; on Zä, Yp, and Ôn in a second stage, we con- 
sistently estimate the scaled coefficients 6,1 = ô1/(1 —p?)'”, Yor =n/( ape, 
On = A1/(1 — p?)'!?, and op = a;/(1 — p?)'/?. A simple test of the null hypothesis 
that y, is exogenous (where we maintain, of course, that z is exogenous) is just the 
standard f¢ statistic on 62. As in Section 15.7.2, we can estimate the original parame- 
ters by dividing each of the scaled coefficients by (1 + ô’, 22) 1/ 2 Bootstrapping is a 
natural way to obtain standard errors; the delta method can also be used. 

Alternatively, as in the binary case, we can compute the response probabilities (or 
expected values) for the second-step ordered probit that includes the residuals, ij. 
Then, the response probabilities, or their derivatives, can be averaged out over în to 
obtain consistent estimators of the APEs. 

Naturally, allowing endogenous explanatory variables that do not have a condi- 
tional, homoskedastic normal distribution is more difficult. One can replace equation 
(16.31) with y, = 1[zd2 + v2 > 0] where v has a standard normal distribution, and 
then use MLE. That requires obtaining P(y, = j |Z, y2), j =0,..., J — 1 for yy =0 
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and y, = 1, just as in the binary case. The estimation problem is not particularly 
difficult; see, for example, Adams, Chiang, and Jensen (2003). As discussed in Section 
15.7.3, a two-stage approach that replaces y, with fitted probabilities from the first- 
stage probit is not justified; it produces inconsistent estimators of the parameters and 
(probably) the APEs. 

A simpler but more radical solution is to assume that a single (estimable) function 
of (z, y2) is correlated with the unobservables in the structural ordered probit model. 
This assumption is implicit in all of the control function approaches we have imple- 
mented. For example, in the case of equation (16.31), v2 plays the role of the single, 
estimable function. To be precise, suppose we explicitly introduce unobservables, r1, 
thought to be correlated with y,. Then, we are interested in the response probabilities 


P(y = j| z, 2,11) = P(y, = j |Z, 92,11), (16.33) 


where the equality simply implies an exclusion restriction. Because rı is not observed, 
we will integrate rı out of the response probabilities when computing partial effects. 
Define the standardized residual for y, as 


e, = [yz — ®(263)]/{(20>)[1 — O(26>)}}"”, (16.34) 


under the assumption that D( y, | z) follows a probit model. By construction, E(e | z) 
= 0 and Var(ez |z) = 1. Unlike v in equation (16.31), e2 cannot be independent of z 
because its support depends directly on z. Nevertheless, suppose we simply assert that 


D(r1 |z, y2) = D(r1 | e2). (16.35) 


Under assumption (16.35), it follows that we can consistently estimate the APEs of 
(z1, y2) on P(y, = j|Z1, y2,r1) by estimating P(y,; = j | Z1, ¥2,e1) and then averaging 
out e;. In the language of Blundell and Powell (2004), the ASF for the response 
probabilities is 


ASF;(z1, y2) = Eea [Dj(Z1; Ya; 2), (16.36) 


where p;(Z1, ¥2,e2) = P(Y; = j| Z1, ¥2, e2). Now, the approach to estimating APEs is, 
in principle, straightforward. In the first stage, we estimate a probit of yp on z; and 
construct the standardized residuals, ê = (yp — ®2)/[®2(1 — ®2)] 1/2 Next, we es- 
timate a model for the P;(Z1, Vo, €2) by inserting ên for the unobserved e;7. Because y, 
is an ordered outcome, any estimation approach should recognize the ordered nature. 
Because ordered probit (and ordered logit) are straightforward, we might, as an ap- 
proximation, use an ordered probit with êp entered in a flexible way, for example, 
polynomials, and possibly interacted with (Z4, yp). The parameter estimates from 
such an ordered probit would not necessarily mean much, but the APEs would be 
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easy to estimate by averaging out the ên in the estimated response probabilities. At 
the very least, just adding ên as a single explanatory to the ordered probit and con- 
ducting a ¢ test on ên is valid as an endogeneity test; as usual, the null hypothesis 
is that y, is exogenous. (Of course, putting in the residuals from an LPM, or the 
unstandardized probit residuals, also provides a valid test.) Naturally, because the 
ordered probit contains binary probit as a special case, the method can be applied to 
the standard binary probit model. 

The key drawback to the computationally simple procedure just described is that 
there is little basis for assumption (16.35) when y, is binary. Nevertheless, it could 
hold, and then the only issue is choosing functional forms. One could use non- 
parametric or semiparametric methods in each step, similar to Blundell and Powell 
(2004), to overcome objections caused by specific functional forms (such as ordered 
probit in the second stage). 


16.3.4 Panel Data Methods 


We can easily adapt the correlated random effects (CRE) methods for binary probit 
(see Section 15.8.2) to ordered probit. We start with the standard latent variable 
model 


Vit = XiuB + Ci + eit, t=1,...,T7, (16.37) 
where 
eu|X;,¢; ~ Normal(0,1), t= 1,...,T, (16.38) 


and X; can contain time-period dummies but not an overall intercept. Then, the 
observed ordered response is yp = 0 if yj <0, Vi = 1 if «1 < yý < %2, and so on. 


We can add the Chamberlain-Mundlak device: 
ci = Y + Xič + ai, ai |x; ~ Normal(0, g2). (16.39) 


Now, if we compute the response probabilities p;(Xi,X:) = P( Vip = J | Xin Xi) = 
P( Vit = j| Xi), it is easily seen that these have the ordered probit form with parame- 
ters B,, Ča, and daj, j = 1,...,J, where the a subscript denotes division by (1 + aż) Me 
and y, is absorbed into the cut points. Estimation of the scaled parameters proceeds 
by pooled ordered probit, where we allow for unrestricted serial dependence. Natu- 
rally, the APEs of P(y;,= j|Xi = Xn ci; =c) with respect to elements of x, are 
obtained from 


N 
NT YO pj(%1,%i, a), (16.40) 
i=] 
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where 6, represents the vector of all scaled parameter estimates; the only ones we can 
obtain without further assumption. 

A full MLE approach is possible if we assume that the e;, are independent condi- 
tional on (x;,c;), but, computationally, full MLE is more difficult than the pooled 
(partial) MLE. As always, the pooled method identifies only the APEs, but it is more 
robust than the full MLE. 

Wooldridge (2005b) provides a framework to estimate dynamic ordered probit 
models. An important issue concerns how the dynamics should enter the response 
probabilities. Wooldridge proposes including a set of dummies indicating the previ- 
ous period’s outcome. Namely, define J dummies, say Wi 1—1,1, .--, Wi 1—1,7, where 
wi 1j = Ly, 1 = j], and then w; 1 = (Wi 1-1,1,- -, Wi 1-1,7) is included among the 
explanatory variables. So, the latent variable model is 


Vip = Lid + Wi -1P + Ci + Uin, t= lT: (16.41) 


To account for the initial conditions problem, the unobserved effect, c;, is modeled as 
Ci = W + Wion + zič + a;, where w is the J-vector of initial conditions, w;oj, and z; is 
the entire history of the strictly exogenous explanatory variables zz. If a; is indepen- 
dent of (wio,z;) and distributed as Normal(0, c2), we can apply random effects 
ordered probit to the latent variable equation 


Vå = Zit + Wi 1P + Wiot + Zi + Ai + Uir, t=1,...,T7, (16.42) 


where we absorb the intercept into the cut parameters, «. Any software that 
estimates RE ordered probit models can be applied directly to estimate all 
parameters, including a2; we simply specify the explanatory variables at time f as 
(Zit; Wi,1-1, Wio, Zi). (Pooled ordered probit does not consistently estimate any interest- 
ing parameters; see the discussion in Section 15.8.4 for the binary probit case.) 
APEs are easily computed, as discussed in Wooldridge (2005b). Not surprisingly, 


the APEs depend on the coefficients multiplied by (1 + 62)" = 


Problems 


16.1. Use the data in KEANE.RAW to answer this question. 


a. Estimate the model reported in Table 16.1, using the data for 1981. Do any of the 
coefficients differ in important ways from those in Table 16.1 (for 1987)? 


b. Estimate the model pooled across all years, and include year dummies for 1982 to 
1987. Explain why, in general, the standard errors and test statistics should be made 
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robust to arbitrary serial dependence. Do the usual and robust standard errors differ 
substantially? 


c. Should the year dummies be kept in the model? Explain. 


d. Using the model estimated in part b, estimate the change in the probability of 
being employed for a black man with five years of experience when educ increases 
from 12 to 16. Obtain the estimates for both 1981 and 1987, and comment. 


e. How would you test whether the coefficients on exper and exper? have changed 
over time? 


16.2. Use the data in PENSION.RAW for this exercise. 


a. Estimate a linear model for pctstck, where the explanatory variables are choice, 
age, educ, female, black, married, finc25,..., fincl101, wealth89, and prftshr. Why 
might you compute heteroskedasticity-robust standard errors? 


b. The sample contains separate observations for some husband-wife pairs. Compute 
standard errors of the estimates from the model in part a that account for the cluster 
correlation within family. (These should also be heteroskedasticity-robust.) Do 
the standard errors differ much from the usual OLS standard errors, or from the 
heteroskedasticity-robust standard errors? 

c. Estimate the model from part a by ordered probit. Estimate E(pctstck |x) for a 
single, nonblack female with 12 years of education who is 60 years old. Assume she 
has net worth (in 1989) equal to $150,000 and earns $45,000 a year, and her plan is 
not profit sharing. Compare this with the estimate of E(pctstck |x) from the linear 
model. 


d. If you want to choose between the linear model and ordered probit based on how 
well each estimates E(y |x), how would you proceed? 

16.3. Consider the ordered probit model under exponential heteroskedasticity, as in 
(16.29). 

a. Derive the response probabilities P(y = j |x). 


b. Write down the log likelihood as a function of all parameters, a, $, and 6. Find the 
gradient of the log likelihood with respect to ô, and evaluate the gradient at ô = 0. 


c. What might be a useful variable addition test for Ho : ô = 0? 
d. For the outcome y= 1 (just for concreteness), define the average structural 
function as 


ASF) (x) = E¢,(1 [a1 — xB < e; < a — xf). 
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Find the ASF in terms of the (unconditional) cdf of e; (and x and the parameters). Is 
the cdf of e; known? 


e. Use iterated expectations to show that 


ASF) (x) = Ex, {[exp(—xi14)(22 — xB)] — Plexp(—xid)(m — xf)]. 


If you have estimated the parameters by MLE, how would you estimate ASF; (x), 
and how would you use it to estimate APEs? 


16.4. Using the data in PENSION.RAW, define a variable invest = 0 if pctstck = 0, 
invest = 1 if pctstck = 50, and invest = 2 if pctstck = 100. 


a. Estimate the ordered probit model in Example 16.2 but with invest as the depen- 
dent variable. What do you conclude? 


b. Are there any interesting quantities that would differ between using pctstck and 
invest as the dependent variables? 


16.5. Consider the ordered probit model in (16.30) and (16.31), but assume that 
v2 |z ~ Normal(0, exp(zé,)). In other words, v) contains heteroskedasticity. Assume 
that (uw, e2) is jointly normal and independent of z, where e) = exp(—zé2/2)v2 is the 
standardized error; in particular, both u; and ez are independent of z with a standard 
normal distribution. 

a. Propose vV N-consistent estimators of 6) and é. 

b. Show that u; = 01e2 + e1, where e; is independent of (z, e2). Why is e; also inde- 
pendent of y,? 

c. Propose a two-step method for consistently estimating the parameters in equation 
(16.30) and the cut points up to scale. 

d. How would you consistently estimate the APEs? 

e. If equation (16.30) holds but y, > 0, and (16.31) holds with log( y3) in place of yy, 
how would you proceed? 

f. Now suppose that E(y,|z) = exp(zd2) and Var(y,|z) = exp(zé), and assume 
that e2 = [y> — E(y,|z)]/[Var(y,|z)]'/? is independent of z. How would you allow 
for endogeneity of y, in an ordered probit model? 


16.6. Write a panel data unobserved effects ordered probit model, with a potentially 
endogenous explanatory variable, in latent variable form as 


Vin = Zin, + Y Vi + cn + Uin, Uin | Zi, ca ~ Normal(0, 1), 
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where, we assume the observed outcome, y,,, is defined to take on values in 
{0,1,...,J} with cut points «,...,«,. The notation extends in a natural way that of 
Section 15.8.5. Assume, in addition, that 


Ci = Yi + Zig) + aij, dil | z; ~ Normal(0, a) 
Yin = Zð + Wy + Zier + viv, vin |Z; ~ Normal(0, t) t= Winged 


a. Propose a two-step control function approach to estimate ôi, y,, and the cut 
parameters «; up to scale. Put a g index on the scaled parameters to distinguish them 
from the original parameters. 


b. Show that the average structural function for outcome 1 < j <J-—1 can be 
written as 


ASF(21; Ya) = EG, vn) O(g — 21991 — yga — ZiSgi — Ogi Vir) 


— Dog, j-1 — 21891 — Ye V2 — BiEgi — Ogivi2)], 
where Ez vn) (t) denotes the expected value with respect to the distribution of 
(Zi, Vi). 
c. How would you estimate the APEs and obtain valid standard errors? 


d. Are the estimators of the parameters and APEs consistent if v;,2 is correlated with 
uin for some r # t? Explain. 


l 7 Corner Solution Responses 


17.1 Motivation and Examples 


We now turn to models for limited dependent variables that have features of both 
continuous and discrete random variables. In particular, they are continuously dis- 
tributed over a range of values—sometimes a very wide range—but they take on one 
or two focal points with positive probability. Such variables arise often in modeling 
individual, family, or firm behavior, and even when studying outcomes at a more 
aggregated level, such as the classroom or school level. 

The most common case is when the nonnegative response variable, y, has a 
(roughly) continuous distribution over strictly positive values, but P(y = 0) > 0. We 
call such a variable a corner solution response or corner solution outcome, where the 
corner in this case is at zero. Corners can occur at other values, too. For example, 
consider the population of families making charitable contributions during a given 
year. If y is the fraction of charitable contributions made to religious organizations, 
we are likely to see a wide range of values between zero and one, and then pileups at 
the two endpoints of zero and one. If so, the corners are at zero and one, and it 
makes sense to treat y as having a continuous distribution over the open interval 
(0,1). 

Corner solution responses are often called “censored responses,” a label that 
comes from situations with actual data censoring. Consequently, the leading model 
that we cover in this chapter is sometimes called a “censored regression model.” In- 
stead, we use the somewhat unconventional name corner solution model because we 
are trying to capture features of an observed corner solution response. The word 
“censored” implies that we are not observing the entire possible range of the response 
variable, but that is not the case for corner solution responses. For example, in a 
model of charitable contributions, the variable we are interested in explaining, both 
for theoretical reasons and for the purposes of policy analysis, is the actual amount of 
charitable contributions. That this outcome might be zero for a nontrivial fraction of 
the population does not mean that charitable contributions are somehow “censored 
at zero,” a common but misleading phrase that one sees used in the analysis of corner 
solution responses. The fact that, say, charitable contributions, labor supply, life in- 
surance purchases, and fraction of investments in the stock market pile up at certain 
focal points means that we might want to use special econometric models. But it is 
not a problem of data observability. 

In Chapter 19, we will study true data-censoring problems, where the underlying 
variable we would like to explain is censored above or below a threshold. Typically, 
data censoring arises because of a survey sampling scheme or institutional con- 
straints. There, we will be interested in an underlying response variable that we do 
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not fully observe because it is censored above or below certain values. The econo- 
metric models for corner solution responses and censored data have similar statistical 
structures, but the ways one uses the estimates and thinks about violations of under- 
lying assumptions are different. To avoid confusion, and to do justice to both corner 
solution models and data-censoring mechanisms, we treat data censoring in a sepa- 
rate chapter on data problems. 

Before we consider models specifically developed for corner solution responses, it is 
important to understand that some simple strategies are available, and also to un- 
derstand the shortcomings of these strategies. Because we are interested in features 
of D(y |x) for observable y, we can model such features directly. If we are interested 
in the effect of x; on the mean response E(y |x), it is natural to ask: Why not just 
assume E(y |x) = xf (where x; = 1) and apply OLS on a random sample? Of course, 
if E(y|x) = xf, then OLS of y; on x;, i=1,...,N is perfectly sensible in that it 
consistently estimates J. Consistency holds even though y; > 0 and P(y; = 0) > 0; 
nothing about consistency of OLS hinges on restricting the probabilistic features of y. 
The problem with estimating a linear model is the assumption of a mean linear in x: 
unless the range of x is fairly limited, E(y|x) cannot truly be linear in x. (In the 
special case where x consists of exhaustive and mutually exclusive dummy variables, 
E(y |x) can always be written as a linear function of x.) A related problem is that the 
partial effects on E(y |x) cannot really be constant over a wide range of x, and using 
standard nonlinear transformations of the underlying explanatory variables cannot 
fully solve the problem. These shortcomings with a linear model for E( y | x) are quite 
analogous to those for the linear probability model. 

As we discussed in Section 15.2 for binary y, it is always valid to view the linear 
model as the linear projection L(v|x). As we know, regardless of the nature of y 
(and x), L(y|x) is always well defined, provided all random variables have finite 
second moments. Further, as we saw in Section 15.7.5, the coefficients in the linear 
projection can, under restrictive assumptions, equal average partial effects (APEs). 
Generally, the linear projection may well approximate the APEs. Even so, one may 
be interested in getting sensible estimates of E(y|x), along with partial effects on the 
conditional mean, over a wide range of x values, and the linear projection may pro- 
vide a poor approximation to the conditional mean, although the approximation is 
better if x includes flexible functions of the underlying covariates. 

Even though a linear model for E(y|x) usually is not suitable, we have seen other 
relatively simple functional forms that ensure a positive conditional mean for all 
values of x and the parameters. The leading case is an exponential function, 
E(y |x) = exp(xf), where again we assume that x; = 1. (We cannot use log( y) as the 
dependent variable in a linear regression because log(0) is undefined.) It is important 
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to understand that there is nothing wrong with an exponential model for E(y |x). It 
has all of the features we want in a conditional mean model for a nonnegative re- 
sponse. It is true that the exponential mean is not compatible with the Tobit model 
that we define below. But the Tobit model, as attractive as it is, is just one possibility, 
and there is nothing logically wrong with directly specifying an exponential model for 
E(y|x). As we will see in Chapter 18, an exponential function also lends itself to 
relatively simple ways to account for endogenous explanatory variables. 

Because Var(y|x) is likely to be heteroskedastic when y is a corner solution, 
nonlinear least squares (NLS) estimation of an exponential model is likely to be in- 
efficient. As we know from Chapter 12, we can use weighted NLS to obtain more 
efficient estimators, although that requires specification of a model for the condi- 
tional variance that would be arbitrary. However, remember that we can use fully 
robust inference whether we use NLS or WNLS, and a thoughtfully constructed 
WNLS estimator might be more efficient than NLS even if we have the conditional 
variance misspecified. 

A more important criticism with modeling E(y|x) as an exponential function, 
or any other function, is that we cannot measure the effect of the x; on any other 
feature of D(yv|x). Often, we are interested in features such as P(y =0|x) and 
E(y|x, y > 0). By construction, a model for E(y |x) says nothing about other fea- 
tures of D(y|x). In this chapter, we are mainly concerned with models that fully 
specify the conditional distribution, although we will touch on other situations where 
we specify less than a full conditional distribution. 

Before we turn to econometric models, we note that economic models of max- 
imizing or minimizing behavior often lead to the possibility of corner solution out- 
comes. A good example is annual hours worked for married women. In the 
population, we see a wide range of hours worked over strictly positive hours, with 
enough different values to take the distribution as being continuous. But we also see a 
nontrivial fraction of married women who do not work for a wage or salary. 

Generally, utility maximization problems allow for the possibility of a corner 
solution, as shown by the following simple model of charitable contributions. 


Example 17.1 (Charitable Contributions): Problem 15.2 shows how to derive a 
probit model from a utility maximization problem for charitable giving, using utility 
function util;(c,q) = c+ a; log(1 + q), where c is annual consumption in dollars and 
q is annual charitable giving. The variable a; determines the marginal utility of giving 
for family i. Maximizing subject to the budget constraint c; + p,q; = m; (where m; is 
family income and p; is the price of a dollar of charitable contributions) and the in- 
equality constraints c,g > 0, the solution q; is easily shown to be q; = 0 if a;/p; < 1, 
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and qi = a;/p; — 1 if a;/p; > 1. We can write this relation as 1 + q; = max(1,a;/p;). 
If a; = exp(zjy + ui), where u; is an unobservable, then charitable contributions are 
determined by the equation 


log(1 + gi) = max(0, z;y — log(p;) + ui]. (17.1) 


Because q; = 0 if and only if log(1 + qi) =0, equation (17.1) implies that the 
probability of observing zero charitable contributions is strictly positive. Further, 
because u; has a normal distribution, g; has a continuous distribution over strictly 
postive values. 

The charitable contributions example is a special case of a model that has become 
a workhorse for corner solution responses when the only corner is at zero—the 
canonical case. In the population, let y be the corner solution response and let x be 
the row vector of covariates (which contains unity as its first element). Assume 


y = max(0, x$ + u), (17.2) 


where u is unobservable. Naturally, u could have a variety of distributions, and its 
conditional distribution D(u | x) could depend on x. But we will mostly work with the 
assumption 


u|x ~ Normal(0, o°), (17.3) 


which implies that u is independent of x. Assumptions (17.2) and (17.3) define the 
type I Tobit model (after Tobin, 1958). This model is also called the standard censored 
regression model, but, as we mentioned before, we avoid the word “censored” in this 
chapter because it connotes some sort of data censoring. Amemiya (1985) gave it the 
“type I” label, and we use that here because it is neutral with respect to the nature of 
the application. (Interestingly, Tobin’s original application to spending on consumer 
durables is clearly a corner solution application, and he never uses the word “‘cen- 
sored” in his article. Instead, he refers to the response taking on its “limit value.” 
Plus, Tobin was careful to compare the Tobit estimates of the conditional mean with 
the linear model estimates.) It is handy to have a notation for when a variable follows 
a Tobit model. If D(y|x) is determined by (17.2) and (17.3), we write D(y |x) = 
Tobit(xf, o°). 

The normality assumption for u means that it has unbounded support, and, be- 
cause u is independent of x, there is always positive probability (for any x and any 
value of £) that xB + u < 0, which means P(y = 0|x) > 0. 

Equation (17.2) has the benefit of directly relating the variable of interest, y, to 
observed explanatory variables and an unobservable. Nevertheless, sometimes it is 
useful to write (17.2) as a latent variable model: 
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y*=xß +u, u|x ~ Normal(0, o°), (17.4) 
y; = max (0, y*). (17.5) 


The latent variable formulation has the danger of suggesting that we are interested 
in E(y* |x), but it will be valuable for certain derivations later on. Given (17.4), the 
latent variable y* satisfies the classical linear model assumptions. 

We can write the model in Example 17.1 as in equation (17.2) by defining (in the 
population) y = log(1 + q) and x = [z,log(p)]. This particular transformation of q, 
along with the restriction that the coefficient on log(p) is —1, are products of the 
specific utility function used in the example. In practice, one might just take y = q 
and not impose restrictions on the vector f. In such applications, we must be careful 
not to put too much emphasis on y*, which some might view as “‘desired” or “latent” 
charitable contributions (which can, evidently, be negative). In corner solution 
applications, we are interested in y, which would be actual charitable contributions. 


17.2 Useful Expressions for Type I Tobit 


Because y is a nonlinear function of x and u, we will want to derive various features 
of its conditional distribution, D(y |x). Given that assumption (17.3) fully specifies 
D(u|x), we can fully characterize D(y| x). But before doing so, it is useful to derive 
general features of the conditional mean and median of y that do not require full 
distributional assumptions. 

First, suppose E(u |x) = 0. Then, because the function g(z) = max(0, z) is convex, 
it follows from the conditional Jensen’s inequality (see Appendix 2A) that 


E(y|x) > max(0, E(xf + u| x)) = max(0, xf). (17.6) 


Therefore, although we cannot find E(y| x) without further assumptions, we do have 
a lower bound as a function of xf. 

Next, assume that rather than a zero mean, Med(u |x) = 0. Unlike the expected 
value, the median operator passes through monotonic functions, and the function 
g(z) is monotically increasing (though not strictly so). Therefore, 


Med(y| x) = max(0, Med(xf + w|x)) = max(0, xf), (17.7) 


and so, without any restrictions on D(u |x) other than a zero median (and certainly 
independence between u and x is not required), we have Med(y|x) as a known 
function of xf. From equations (17.6) and (17.7), if D(u|x) is symmetric about 
zero, it follows that E(y|x) > Med(y|x). We will return to expression (17.7) in 
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Section 17.5.4 as a basis for estimating # under the zero conditional median as- 
sumption for u. 

When u is independent of x and has a normal distribution, we can find an explicit 
expression for E(y |x). We first derive P(y > 0|x) and E(y|x, y > 0), which are of 
interest in their own right. Then, we use the law of iterated expectations to obtain 
E(y|x): 


E(y|x) = P(y =0|x)-0+ P(y > 0|x) -E(y|x, y > 0) 
= P(y > 0|x)- E(y|x, y > 0). (17.8) 


Deriving P(y > 0|x) is easy. Define the binary variable w = 1 if y > 0, w= 0 if 
y = 0. Then w follows a probit model: 


P(w = 1|x) = P(y* > 0| x) = P(u > -x£ |x) 
= P(u/o > —xf/o) = ®(xf/o). (17.9) 


One implication of equation (17.9) is that y = B/c, but not £} and a separately, can be 
consistently estimated from a probit of w on x. 

To derive E(y |x, y > 0), we need the following fact about the normal distribution: 
if z ~ Normal(0, 1), then, for any constant c, 


glc) 


Eee T oE 


where ¢(-) is the standard normal density function. (This is easily shown by noting 
that the density of z given z > c is ¢(x)/[1 — ®(c)], x > c, and then integrating x(x) 
from c to œ.) Therefore, if u ~ Normal(0, a7), then 


55) =i eal 


We can use this equation to find E(y |x, y > 0) when y follows a Tobit model: 


(xB/c) 
E(y|x, y > 0) = x$ + E(u u> -x)= x5 +o 17.10 
(wlx, y> 0) (u ) Ha (17.10) 
since | — ®(—xf/o) = ®(xf/c). Although it is not obvious from looking at equation 
(17.10), the right-hand side is positive for any values of x and $. 
For any c, the quantity 1(c) = ¢(c)/®(c) is called the inverse Mills ratio. Thus, 
E(y |x, y > 0) is the sum of xf and ø times the inverse Mills ratio evaluated at xB/o. 


u 


E =oE 
(u|u >c) =o (: 
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If x; is a continuous explanatory variable, then 


Eor g Eso) 


assuming that x; is not functionally related to other regressors. By differentiating 


A(c) = ¢(c)/®(c), it can be shown that - (c) = —A(c)[e + A(c)], and therefore 


dE(y|x, y > 0) 
Ox; 


= BAL — A(xB/o)[xB/o + A(xB/o)]}. (17.11) 


This equation shows that the partial effect of x; on E(y|x, y > 0) is not entirely 
determined by £;; there is an adjustment factor multiplying f;, the term in { - }, that 
depends on x through the index xf/c. We can use the fact that if z ~ Normal(0, 1), 
then Var(z|z > —c) = 1 — A(c)[c + A(c)] for any c e R, which implies that the adjust- 
ment factor in equation (17.11), call it O(xB/a) = (1 — A(xB/o)[xB/o + A(xB/o)]), is 
strictly between zero and one. Therefore, the sign of f; is the same as the sign of the 
partial effect of x;. 

Other functional forms are easily handled. Suppose that xı = log(z,) (and that this 
is the only place zı appears in x). Then 


OTL =O) = (8,/21)88/0), (17.12) 


where fi now denotes the coefficient on log(z;). Or, suppose that xı = z; and 
x2 = z?. Then 


EOI 9) L (8, + 26:21) 01x8/0), 


where £; is the coefficient on z; and 2, is the coefficient on z?. Interaction terms are 
handled similarly. Generally, we compute the partial effect of xf with respect to the 
variable of interest and multiply this by the factor 0(xB/c). 

All of the usual economic quantities such as elasticities can be computed. The 
elasticity of y with respect to xı, conditional on y > 0, is 


OE(y|x,y > 0) | xy 
Ox] E(y|x, y > 0) 


(17.13) 


and equations (17.11) and (17.10) can be used to find the elasticity when xı appears 
in levels form. If zı appears in logarithmic form, the elasticity is obtained simply as 
ô log E(y |x, y > 0)/é log(z1). 
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If x; is a binary variable, the effect of interest is obtained as the difference between 
E(y|x, y > 0) with x; = 1 and x; = 0. Other discrete variables (such as number of 
children) can be handled similarly. 

We can also compute E( y|x) from equation (17.8): 


E(y|x) = P(y > 0|x)- E(y|x, y > 0) 
= O(xf/o)[xB + o2(xB/c)| = ®(xB/c)xB + o¢(xB/c). (17.14) 


We can find the partial derivatives of E( y |x) with respect to continuous x; using the 
chain rule. In examples where y is some quantity chosen by individuals (labor supply, 
charitable contributions, life insurance), this derivative accounts for the fact that 
some people who start at y = 0 may switch to y > 0 when x; changes. Formally, 

GE(y|x) _ aP(y > 0|x) 


= -E VI 0 P 0 x 
an By (y|x, y > 0) + P(y > O|x) 


dE(y |x, y > 0) 
Ox; 


. (17.15) 


This decomposition is attributed to McDonald and Moffitt (1980). Because P(y > 
0| x) = ®(xf/a), 6P(y > 0| x)/0x; = (B;/o)b(xB/o). If we plug this along with equa- 
tion (17.11) into equation (17.15), we get a remarkable simplification: 


OE(y|x) _ ®(xp/o)8;. (17.16) 


Ox; 

The estimated scale factor for a given x is ®(xĝ/ô). This scale factor has a very in- 
teresting interpretation: ®(xf/é) = P(y > 0|x); that is, ®(xf/é) is the estimated 
probability of observing a positive response given x. If O(xf/ 6G) is close to one, then 
it is unlikely we observe y,; = 0 when x; = x, and the adjustment factor becomes 
unimportant. We can evaluate ®(xf/6) at interesting values of x to determine how 
the estimated partial effects change as the covariates change. One possibility is to 
plug in mean values, although this need not correspond to any particular population 
unit when x contains discrete elements (and even sometimes when x contains all 
continuous elements). We can use median values, or plug in various quantiles. Often 
it is useful to have a single scale factor, and the scale factor that delivers APEs is 
probably the most useful. The APE for a continuous variable x; is estimated as 


[r Soad, (17.17) 
i=l 


The scale factor in equation (17.17) is the average of P(y > 0|x) across the sample. 
The delta method can be used to obtain a valid asymptotic standard error for (17.17), 
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but the bootstrap is convenient and feasible because estimation of Tobit models can 
be done fairly quickly. 

Naturally, the scale factor in (17.17) is always between zero and one, and that fact 
helps explain why Tobit coefficients are typically larger than OLS coefficients from a 
linear regression. If 7; is the OLS estimate on a continuous variable x; from the re- 
gression y; on x;, we can compare ĵ, to (17.17) as an indication of whether the linear 
model gives similar estimates of the APEs. Sometimes it will, but in other cases the 
linear model APEs can be notably different from the Tobit APEs. 

For discrete explanatory variables or for large changes in continuous ones, we can 
compute the difference in E(y |x) at different values of x. For example, suppose xx is 
a binary variable (such as a policy indicator), and define, for each observation i, the 
two indices w;; = XK) Bix) + Êk and wi = XK) BK): where xxx) is the 1 x (K — 1) 
row vector with x;x dropped. Then w, is the estimated index for person i when 
Xx = l and wy is the estimated index for person i when xx = 0. (One of these is 
a counterfactual because x;g is either zero or one for each i.) Then the average 
difference 


N 
NTS [O01 /4) 1 + Ed (41 /4)] — [P/W + EO(10/4)]} (17.18) 
i=l 


is the estimated APE of the binary variable xg. Again, bootstrapping is a convenient 
method of computing a standard error. 

The equations for the partial effects in the type I Tobit model, equations (17.11) 
and (17.16), and the estimates in (17.17) and (17.18) reveal an important point about 
the parameters: g as well as $ appears in the partial effects. In other words, if we 
can only estimate f, we cannot estimate the partial effects of the covariates on 
E(y|x, vy > 0) and E(y|x). Therefore, treatments of the type I Tobit for corner 
solutions that refer to ø as “ancillary” —of secondary importance—are misleading. 
In a linear model, of course, the variance of the errors plays no role in obtaining 
partial effects on the mean, and so the estimated error variance plays a role only in 
obtaining the usual OLS standard errors. But the variance of u in equation (17.2) 
directly enters the conditional means, and so we should not think of ø as ancillary 
when our interest is in partial effects on the mean. (As we will see in Chapter 20, in 
true data censoring contexts we are interested in f, and g becomes ancillary to esti- 
mating partial effects.) 

Equations (17.9), (17.11), and (17.14) show that, for continuous variables x; and 
Xh, the relative partial effects on P(y > 0|x), E(y|x, y > 0), and E(y|x) are all 
equal to f,/f), (assuming that £, # 0). This feature can be a limitation of the Tobit 
model, and we will study models that relax this implication in Section 17.6. 
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By taking the log of equation (17.8) and differentiating, we see that the elasticity 
(or semielasticity) of E(y|x) with respect to any x; is simply the sum of the elasticities 
(or semielasticities) of ®(xf/c) and E(y |x, y > 0), each with respect to xj. 


17.3 Estimation and Inference with the Type I Tobit Model 


Let {(x;, y;): i= 1,2,...N} be a random sample following the censored Tobit model. 
To use maximum likelihood, we need to derive the density of y; given x;. We have 
already shown that f(0|x;) = P(y,; =0|x;) = 1 — ®(x;f/c). Further, for 4 > 0, 
P(y; < ¥|xi) = P(Y} < y|x;), which implies that 


F(y|xi) =f lx), ally > 9, 


where f*(-|x;) denotes the density of y;* given x;. (We use y as the dummy argument 
in the density.) By assumption, y; |x; ~ Normal(x;ß, 07), so 


Fal = Iy-e -0< y< o. 


(As in recent chapters, we will use £ and g? to denote the true values as well as 
dummy arguments in the log-likelihood function and its derivatives.) We can write 
the density for y; given x; compactly using the indicator function 1[-] as 


f(y |X) = {1 — O(xiB/a)} f(a) bl(y — xiB)/o]} 1", (17.19) 


where the density is zero for 7 < 0. Let 0 = (f',c7)' denote the (K + 1) x 1 vector of 
parameters. The log likelihood is 


40) = 1[y; = 0] log[l — ®(xiB/o)] + 1[y; > O}{log giy; — xiB)/o] — log(a”)/2}. 
(17.20) 


Apart from a constant that does not affect the maximization, equation (17.20) can be 
written as 


I[y; = 0] log[l — ®(x;B/o)] — 1y; > O]{(y; — xiB)?/20? + log(a?) /2}. 
Therefore, 


0¢;(9) /OB = —1[y; = 0] 9(xiB/o)(xi/a)[1 — B(x:B/a)| + 1[y; > y; — XiB)xi/o7 
(17.21) 


d¢;(8) /d0° = I[y; = 0]6(xiB/2)(xiB)/{20°[1 — B(x:B/2)]} 


+1; > (y: — xiB)°/(204) — 1/(207)}. (17.22) 


Corner Solution Responses 677 


The second derivatives are complicated, but all we need is A(x;, 0) = —E[H,(6) | xi]. 
After tedious calculations it can be shown that 


x!x; bx! 
Ato) = [Anes bal] 


17.23 
b;Xi Ci ( ) 


where 

a; = —0 *{xipo; — [67 /(1 — ®)] — ©}, 

bi = o*{ (x) h; + 4; — [(xiv) 47 /(1 — ®:)]}/2, 

ci = —o*{ (x7) h:i + (xin) 6; — (x) /(1 — ®))] — 20)}/4, 


y = B/o, and ¢, and ®; are evaluated at x;y. This matrix is used in equation (13.32) to 
obtain the estimate of Avar(@). See Amemiya (1973) for details. 

Testing is easily carried out in a standard maximum likelihood estimator (MLE) 
framework. Single exclusion restrictions are tested using asymptotic f statistics once 
Ê; and its asymptotic standard error have been obtained. Multiple exclusion restric- 
tions are easily tested using the likelihood ratio (LR) statistic, and some econometrics 
packages routinely compute the Wald statistic. If the unrestricted model has so many 
variables that computation becomes an issue, the lagrange multiplier (LM) statistic is 
an attractive alternative. 

The Wald statistic is the easiest to compute for testing nonlinear restrictions on f, 
just as in binary response analysis, because the unrestricted model is just standard 
Tobit. 


17.4 Reporting the Results 


As with any estimation of a parametric model, the parameter estimates f and ê and 
their standard errors should be reported. We saw in Section 17.2 that the Ê; give the 
direction of the partial effects on the means, and the ratios are the relative partial 
effects for continuous variables. The value of the log likelihood should be included 
for testing purposes and also to allow comparisons with other nonnested models. A 
goodness-of-fit statistic for the conditional mean E(y |x) is often of interest, particu- 
larly for comparing with other models of E(y |x) (including linear models). A simple 
measure is the squared correlation between the actual outcomes, y;, and fitted values, 
j,, obtained by evaluating equation (17.14) at x; and the MLEs, f and ô. 

It is also important to report partial effects. For continuous variables we can use 
(17.17) and for binary variables we can use (17.18). For discrete variables that are not 
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Table 17.1 
OLS and Tobit Estimation of Annual Hours Worked 


Dependent Variable: hours 


Independent Variable Linear (OLS) Tobit (MLE) 

nwifeinc —3.45 —8.81 
(2.54) (4.46) 

educ 28.76 80.65 
(12.95) (21.58) 

exper 65.67 131.56 
(9.96) (17.28) 

exper? —.700 —1.86 
(.325) (0.54) 

age —30.51 —54.41 
(4.36) (7.42) 

kidslt6 —442.09 —894.02 
(58.85) (111.88) 

kidsge6 —32.78 —16.22 
(23.18) (38.64) 

constant 1,330.48 965.31 
(270.78) (446.44) 

Log-likelihood value — —3,819.09 
R-squared .266 275 
G 750.18 1,122.02 


binary it is less obvious how to report a single APE. For key explanatory variables, 
one might want to evaluate partial effects at a range of values, and then average out 
across the other explanatory variables. Or, we can evaluate partial effects with each 
covariate evaluated at its mean or its median, or other quantiles of interest. 


Example 17.2 (Annual Hours Equation for Married Women): We use the Mroz 
(1987) data (MROZ.RAW) to estimate a reduced form annual hours equation for 
married women. The equation is a reduced form because we do not include hourly 
wage offer as an explanatory variable. The hourly wage offer is unlikely to be exog- 
enous, and, just as important, we cannot observe it when hours = 0. We will show 
how to deal with both these issues in Chapter 19. For now, the explanatory variables 
are the same ones appearing in the labor force participation probit in Example 15.2. 

Of the 753 women in the sample, 428 worked for a wage outside the home during 
the year; 325 of the women worked zero hours. For the women who worked positive 
hours, the range is fairly broad, ranging from 12 to 4,950. Thus, annual hours 
worked is a reasonable candidate for a Tobit model. We also estimate a linear model 
(using all 753 observations) by OLS. The results are given in Table 17.1. 
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Not surprisingly, the Tobit coefficient estimates are the same sign as the corre- 
sponding OLS estimates, and the statistical significance of the estimates is similar. 
(Possible exceptions are the coefficients on nwifeinc and kidsge6, but the ¢ statistics 
have similar magnitudes.) Second, though it is tempting to compare the magnitudes 
of the OLS estimates and the Tobit estimates, such comparisons are not very infor- 
mative. We must not think that, because the Tobit coefficient on kids/t6 is roughly 
twice that of the OLS coefficient, the Tobit model somehow implies a much greater 
response of hours worked to young children. 

The scale factor computed in equation (17.17) is about .589, and we can multiply 
this by the Tobit coefficients—at least on the roughly continuous variables—to 
obtain estimated APEs. For example, the APE for educ is about .589(80.65) = 47.5, 
and a bootstrap standard error based on 500 replications is about 13. The Tobit 
estimated APE is well above the comparable OLS estimate, 28.8, and just as precisely 
estimated. 

For the discrete variable kids/t6, if we use the calculus definition of an APE, the 
Tobit APE is about —526.68, which again is much larger in magnitude than the OLS 
coefficient (—442.1). However, if we compute the difference in estimated expected 
values at kids/t6 = 1 and kids/t6 = 0, and average these differences, the result is about 
—487.2 (bootstrap standard error = 54), and this is more in line with the OLS esti- 
mate (but still larger in magnitude). The APE in moving from one small child to two 
small children is about —246.2 (bootstrap standard error = 12.3): not surprisingly, 
the effect on expected hours of having a second young child is less than having a first 
young child. It makes sense that the OLS estimate is between these two values yet 
closer to the first partial effect: 118 women in the sample have one small child and 
only 29 have two or more. If we compute a weighted average of the two APEs, the 
result is about —439, which is quite close to the OLS estimate and, in some sense, 
verifies that in some cases OLS can provide a good estimate of APEs, even for a 
discrete explanatory variable. Interestingly, the APE computed from the Tobit model 
based on the approximation that kidslt6 is continuous is actually worse than the OLS 
estimate. 

We can also evaluate the partial effects at the average values of the covariates, 
where we plug exper into the quadratic rather than using the average of exper?. The 
scale factor evaluated at the means is about .645, which implies partial effects even 
larger than the APEs. In other words, the PAEs are much larger than the APEs. 

We can also compute the partial effects on E(y |x, y > 0). Again, plugging in the 
mean values of the explanatory variables, the scale factor in equation (17.11) is about 
.451, and this number can be multiplied by coefficients to obtain the estimated 
change in expected hours conditional on hours being positive. 
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We have reported an R-squared for both the linear regression model and the Tobit 
model. The R-squared for OLS is the usual one. For Tobit, the R-squared is the 
square of the correlation coefficient between y; and j,, where ĵ; = ®(x,B/6)xiB + 
6(x;B/G) is the estimate of E(y |x = x;). This statistic is motivated by the fact that 
the usual R-squared for OLS is equal to the squared correlation between the y; and 
the OLS fitted values. 

Based on the R-squared measures, the Tobit conditional mean function fits the 
hours data somewhat better, although the difference is not overwhelming. However, we 
should remember that the Tobit estimates are not chosen to maximize an R-squared— 
they maximize the log-likelihood function—whereas the OLS estimates produce the 
highest R-squared given the linear functional form for the conditional mean. 

When two additional variables, the local unemployment rate and a binary city in- 
dicator, are included, the log likelihood becomes about —3,817.89. The likelihood 
ratio statistic is about 2(3,819.09 — 3,817.89) = 2.40. This is the outcome of a %2 
variate under Ho, and so the p-value is about .30. Therefore, these two variables are 
jointly insignificant. 


17.5 Specification Issues in Tobit Models 


17.5.1 Neglected Heterogeneity 
Suppose that we are initially interested in the model 
y = max(0,xfP+ yq + u), u|x,q ~ Normal(0, o°), (17.24) 


where q is an unobserved variable that is assumed to be independent of x and has a 
Normal(0, t?) distribution. It follows immediately that 


y = max(0,xf+ v), v|x ~ Normal(0, o? + yr”) (17.25) 


Thus, y conditional on x follows a Tobit model, and Tobit of y on x consistently 
estimates $ and y? = g? + y*??. 

What about estimating partial effects on E(y|x,q)? As we discussed in Sections 
2.2.5 and 15.7.1, we are often interested in the APEs, where, say, E(y|x,q) is aver- 
aged over the population distribution of g, and then derivatives or differences with 
respect to elements of x are obtained. From Section 2.2.5 we know that when the 
heterogeneity is independent of x, the APEs are obtained by finding E(y |x). Natu- 
rally, this conditional mean comes from the distribution of y given x. Under the 
preceding assumptions, it is exactly this distribution that Tobit of y on x estimates. In 
other words, we estimate the desired quantities—the APEs—by simply ignoring the 
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heterogeneity. This is the same conclusion we reached for the probit model in Section 
15.7.1. 

If q is not normal, then these arguments do not carry over because y given x does 
not follow a Tobit model. But the flavor of the argument does. A more difficult issue 
arises when q and x are correlated, and we address that in the next subsection. 

We can also ask what happens if, rather than having heterogeneity appear addi- 
tively inside the index, as in equation (17.24), the heterogeneity appears multiplica- 
tively: y = q - max(0, xf + u), where q > 0 and we assume q is independent of (x, u). 
The distribution D(y |x) now depends on the distribution of q, and does not follow 
a type I Tobit model; generally, finding its distribution would be difficult, even if 
we specify a simple distribution for g. Nevertheless, if we normalize E(q) = 1, then 
E(y|x,u) = E(q |x, u) - max(0, xf + u) = max(0, xf + u) (because E(q |x, u) = 1). It 
follows immediately from iterated expectations that if assumption (17.3) holds, then 
E(y |x) has exactly the same form as the type I Tobit model in equation (17.14). That 
equation (17.14) holds for an extension of the Tobit model means that it makes sense 
to estimate E(y| x) = ®(xf/a)xP + o¢(xB/c) by NLS, or a weighted NLS procedure 
(or a quasi-MLE, which we discuss in Chapter 18). NLS and WNLS approaches 
are consistent under (17.14) even though D(y|x) does not follow the type I Tobit 
distribution. 


17.5.2 Endogenous Explanatory Variables 


Suppose we now allow one of the variables in the Tobit model to be endogenous. The 
first model we consider has a continuous endogenous explanatory variable: 


yı = max(0, zı; + %1 yə + u1), (17.26) 
y= Z2 + v2 = 2107; + 2022 + V2, (17.27) 


where (u1, v2) are zero-mean normally distributed, independent of z. If u; and v2 are 
correlated, then y, is endogenous. For identification we need the usual rank condi- 
tion ôn # 0; E(z’z) is assumed to have full rank, as always. 

Naturally, we are interested in estimating 0; and «|, but we are also interested in 
estimating APEs, which depend on o? = Var(u1). The reasoning is just as for the 
probit model in Section 15.7.2. Holding other factors fixed, the difference in y, when 
yə changes from jy, to Y, + 1 is 


max(0, Z 0; + a1 (F> + 1) + u] — max(0,Z 0; + 4y + u]. 


Averaging this expression across the distribution of u, gives differences in expecta- 
tions that have the form (17.14), with x = [Z,, (¥, + 1)] in the first case, x = (Z|, >) 
in the second, and o = oj. 
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Before estimating this model by MLE, a procedure that requires obtaining the 
distribution of (yı, y2) given z, it is convenient to have a two-step procedure that 
also delivers a simple test for the endogeneity of y,. Smith and Blundell (1986) pro- 
pose a two-step procedure that is analogous to the Rivers-Vuong method (see Sec- 
tion 15.7.2) for binary response models. Under bivariate normality of (u1, v2), we can 
write 
ui = Oiv + &, (17.28) 
where 0; = 7, /73, 4, = Cov(u1, v2), t3 = Var(v2), and e; is independent of v2 with a 
zero-mean normal distribution and variance, say, t7. Further, because (u1, v2) is in- 


dependent of z, e} is independent of (z,v2). Now, plugging equation (17.28) into 
equation (17.26) gives 


yı = max(0, zd; + %1 yə + 01v + e1), (17.29) 
where e;|z,v2 ~ Normal(0,z7). Using our previous notation, we can write 
D(y1 |z, y2) = 


Tobit(z161 + %1y2 + 01(y2 — 262), 77) = Tobit(z1d) + %1y2 + A102, T?). 


It follows that, if we knew v2, we would just estimate ôi, «1, 01, and t? by type I 
Tobit. We do not observe v2 because it depends on the unknown vector 62. However, 
we can easily estimate 62 by OLS in a first stage. The Smith-Blundell procedure is as 
follows: 


Procedure 17.1; (a) Estimate the reduced form of y, by OLS; this step gives do. 
Define the reduced-form OLS residuals as #2 = y, — 70>. 

(b) Estimate a standard Tobit of y; on z), yz, and #2. This step gives consistent 
estimators of 6), «1, 01, and 77. 


The usual ¢ statistic on ĉ reported by Tobit provides a simple test of the null 
Ho : 01 = 0, which says that y, is exogenous. Further, under 0; = 0, e} = u, and so 
normality of v2 plays no role: as a test for endogeneity of y, the Smith-Blundell 
approach is valid without any distributional assumptions on the reduced form of yy. 


Example 17.3 (Testing Exogeneity of Other Income in the Hours Equation): As an 
illustration, we test for endogeneity of nwifeinc in the reduced-form hours equation in 
Example 17.2. We assume that huseduc is exogenous in the hours equation, and so 
huseduc is a valid instrument for nwifeinc. We first obtain ĉ as the OLS residuals 
from estimating the reduced form for nwifeinc. When ô is added to the Tobit model 
in Example 17.2 (without unem and city), its coefficient is 24.42 with f statistic = 
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1.47. Thus, there is marginal evidence that nwifeinc is endogenous in the equation. 
The test is valid under the null hypothesis that nwifeinc is exogenous even if nwifeinc 
does not have a conditional normal distribution. 


When 0; 40, the second-stage Tobit standard errors and test statistics are not 
asymptotically valid because ôv has been used in place of 6). Smith and Blundell 
(1986) contain formulas for correcting the asymptotic variances; these can be derived 
using the formulas for two-step M-estimators in Chapter 12. Alternatively, it is easy 
to program the two-step procedure in a bootstrap resampling scheme, and the com- 
putational time should be reasonable even for larger data sets. 

It is easily seen that joint normality of (w,v2) is not necessary for the two-step 
estimator to consistently estimate the parameters. It suffices that u; conditional on 
(z, v2) is distributed as Normal(6;v2, 77). Still, this is a fairly restrictive assumption 
that cannot be expected to hold when vz is discrete or partially discrete. 

We can recover an estimate oj from the estimates obtained from the two-step 
control function procedure. From equation (17.27) we have a? = 0;t3 + 1?, and @3 is 
obtained as the usual estimated error variance from the first-stage regression, whereas 
6, and ĉ? are obtained from the second-stage Tobit. Forming 6? = 6743 + 2? gives us 
all the estimates we need to obtain the APEs. Generally, it is useful to define the 
function 


m(a, a”) = ®(a/a)z + od(a/o). (17.30) 


Using this notation, the estimated partial effects can be obtained by computing 
derivatives or differences of m(z16\ + a y2, 67) with respect to elements of (z1, y2), 
just as we did in the case of exogenous explanatory variables. 

As with all control function procedures, we can easily allow more general func- 
tional forms in both the exogenous and endogenous variables (such as squares and 
interactions). In fact, if we replace x; = (Z1, y2) with x; = g)(z), y2), then the esti- 
mation procedure is unchanged. Of course, we must believe that y) has the reduced 
form in (17.27) with the error term having the properties already described. Some 
additional flexibility is gained by allowing E(u | v2) to be nonlinear—for example, a 
quadratic function, E( | v2) = 01v2 + W,(v3 — 12) (where the variance of v2 is sub- 
tracted from v3 to ensure E(u) = 0)—and then the second step of the procedure adds 
ô and 65 — t} as the control functions. The easiest way to obtain APEs in this case is 
to use derivatives and changes with respect to elements of x; of 


N 
NS" m(xiB, + 61612 + Wy (65 — #3), 7), (17.31) 


i=1 
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where m/(a, a) is defined in (17.30); that is, we average out the reduced form residuals, 
în. Of course, (17.31) can be used in the case where E(u | v2) is linear in v2 and 
Xı = (Z1, y2), but it does not exploit the full distributional assumptions. 

In extending the usual model by allowing E(u; | v2) to be a nonlinear function of 
v2, one should be aware that u; will likely not have a normal distribution. Practically, 
we may be as satisfied with a conditional normal distribution for D(w | v2), even if 
that implies nonnormality of D(u1). 

A full maximum likelihood approach avoids the two-step estimation problem. The 
joint distribution of (y1, y2) given z is most easily found by using 


Si ¥2 12) = f(v | V2, 2) SF (92 |2) (17.32) 


just as for the probit case in Section 15.7.2. The density f(y» |z) is Normal(zô2, t2). 
We already know that D(yı |z, y2) = Tobit(xif, + 0(y2 — 262),t7), where t? = 
o? — (n? /t3), o? = Var(u), tf = Var(v2), and y} = Cov(v2, u1). (As in the two-step 
estimation framework, we can allow E(u; | v2) to be a more flexible function of v2, 
subject to the caveat that we are implicitly using a nonnormal unconditional distri- 
bution for u.) Taking the log of equation (17.32), the log-likelihood function for 
each i is easily constructed as a function of the parameters (0), «1, 62,07, 12,1). The 
usual conditional maximum likelihood theory can be used for constructing standard 
errors and test statistics. When the structural equation is just identified, the two- 
step and MLE estimates are identical (although the two-step inference needs to be 
adjusted). 

Once the MLE has been obtained, we can easily test the null hypothesis of exoge- 
neity of y, by using the ¢ statistic for 6,. Because the MLE can be computationally 
more difficult than the Smith-Blundell procedure, it makes sense to use the Smith- 
Blundell procedure to test for endogeneity before obtaining the MLE. 

If y, is a binary variable, then the Smith-Blundell assumptions cannot be expected 
to hold. Taking equation (17.26) as the structural equation, we could add 


V2 = | [za + v2 > 0) (17.33) 


and assume that (u1, v2) has a zero-mean normal distribution and is independent of z; 
v2 is standard normal, as always. Equation (17.32) can be used to obtain the log like- 
lihood for each i. Since y, given z is probit, its density is easy to obtain: f(y, |z) = 
@(zm)"?[1 — @(zmy)|'"?. The hard part is obtaining the conditional density 
(yı | 32,2), which is done first for y, = 0 and then for y, = 1; see Problem 17.6. 
Unfortunately, as in the probit case (and nonlinear models generally), the simple 
strategy of replacing the binary variable y2, with its fitted value from a probit 
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(or logit, or linear probability model) does not work; it is another example of a for- 
bidden regression. In particular, the distribution of yı given z does not follow a 
Tobit(zd; + 0;®(zm2),«7) model for any variance parameter x7. Generally, 
D(yı |z) is difficult to characterize, although we could properly use it in a two-step 
procedure. But better than that, we can obtain D(y; | y2,z) and use it along with a 
probit specification for D( y2 |z) to perform joint maximum likelihood estimation. 

If y2 is itself a corner solution and follows a Tobit(za,«3) model, then the MLE 
can also be derived. But again, we should not replace y2 with the estimated expected 
value E(y2|z) obtained from a first-stage Tobit. Such a procedure does not produce 
consistent estimates and may be very badly biased. 

Because properly accounting for endogenous explanatory variables that have 
nonnormal conditional distributions in Tobit models is challenging, one still sees 
linear models used when y; is a corner, and then standard IV methods, such as 2SLS, 
can be applied regardless of the nature of y2. As with binary responses, the linear 
model has been much maligned for corner solution responses, but a linear model 
estimated by 2SLS can deliver good estimates of average effects. One of the inap- 
propriate two-step approaches described previously, where fitted probit or Tobit 
estimates are inserted into a second-stage Tobit, is likely inferior to a linear model 
that has been properly estimated by instrumental variables. An interesting topic is to 
find control function methods that can be used for general yz that do not suffer from 
a forbidden regression problem. 

The Smith-Blundell control function method extends immediately to more than 
one endogenous explanatory variables, provided we have sufficient instruments and a 
vector of reduced form errors v2 such that D(w | z, v2) = D(u | v2), where the latter is 
homoskedastic normal with linear mean. For example, we might assume hg( y2) = 
ZN, + v for a strictly monotonic function h,(-) for g = 1,...,Gi, where G; is the 
number of endogenous explanatory variables. Then x; = g,(z1,y>) is the set of ex- 
planatory variables, and we can add the vector of reduced form residuals, obtained 
from G; regressions h,(y24) on z, to a standard Tobit model in the second stage. See 
Problem 17.9 for ways to allow more flexibility in D(u | z, v2). 


17.5.3 Heteroskedasticity and Nonnormality in the Latent Variable Model 


As in the case of probit, both heteroskedasticity and nonnormality result in the Tobit 
estimator f being inconsistent for £. This inconsistency occurs because the derived 
density of y given x hinges crucially on y* |x ~ Normal(xf, 7). 

Rather than focusing on parameters, we must remember that the presence of het- 
eroskedasticity or nonnormality in the latent variable model entirely changes the 
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functional forms for E(y |x, y > 0) and E(y |x). Therefore, it does not make sense to 
focus only on the inconsistency in estimating p. We should study how departures 
from the homoskedastic normal assumption affect the estimated partial derivatives of 
the conditional mean functions. Allowing for heteroskedasticity or nonnormality in 
the latent variable model can be useful for generalizing functional form in corner 
solution applications, and it should be viewed in that light. 

Specification tests can be based on the score approach, where the standard Tobit 
model is nested in a more general alternative. Tests for heteroskedasticity and non- 
normality in the latent variable equation are easily constructed if the outer product 
of the form statistic (see Section 13.6) is used. A useful test for heteroskedasticity is 
obtained by assuming Var(u |x) = g? exp(x;d), where x; is a 1 x Q subvector of x 
(x; does not include a constant). The Q restrictions Ho : ô = 0 can be tested using the 
LM statistic. The partial derivatives of the log likelihood /;(B, 07,6) with respect to B 
and o°, evaluated at ô= 0, are given exactly as in equations (17.21) and (17.22). 
Further, we can show that 6/;/06 = o7x;\(6¢;/607). Thus the outer product of the 
score statistic is N — SSRo from the regression 


1 on ôÊ/ôß, 0¢;/d07, 62x; (0 / 007), i=1,...,N, 


where the derivatives are evaluated at the Tobit estimates (the restricted estimates) 
and SSRo is the usual sum of squared residuals. Under Ho, N —SSRo ~ Xo- 
Unfortunately, as we discussed in Section 13.6, the outer product form of the statistic 
can reject much too often when the null hypothesis is true. If MLE of the alternative 
model is possible, the LR statistic is a preferable alternative. 

We can also construct tests of nonnormality that require only standard Tobit esti- 
mation. The most convenient of these are derived as conditional moment tests, which 
we discussed in Section 13.7. See Pagan and Vella (1989). 

It is not too difficult to estimate Tobit models with u heteroskedastic if a test 
reveals such a problem. When E(y |x, y > 0) and E(y |x) are of interest, we should 
look at estimates of these expectations with and without heteroskedasticity. The 
partial effects on E(y |x, y > 0) and E(y |x) could be similar even though the esti- 
mates of $ might be very different. 

As with the probit model with heteroskedasticity, there is a subtle issue that arises 
when computing APEs when we introduce heteroskedasticity into the type I Tobit 
model. Suppose we replace (17.3) with 


u|x ~ Normal(0, o° exp(xj0)). 


Then, following the same argument in Section 15.7.4, it can be shown that the aver- 
age structural function is 
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ASF(x) = Ex, [™(xB, o° exp(xi16))], 


where m(-,-) is defined in equation (17.30). This means, for example, that the partial 
effect of a continuous variable (evaluated at x) is simply estimated as 


N 
N- y (xB /[é exp(x;16/2)]) Bi, 
= 


where all estimates are the MLEs from the heteroskedastic Tobit model. Unfortu- 
nately, as we discussed in Section 15.7.4, if the heteroskedasticity arises due to inter- 
actions between unobservables that are independent of x and elements of x, then the 
APEs are estimated from the partial derivatives of E(y |x), which are more compli- 
cated and may have signs that differ from the signs of the B;. 

As a rough idea of the appropriateness of the standard Tobit model, we can com- 
pare the probit estimates, say }, to the Tobit estimate of y = B/c, namely, B/G. These 
will never be identical, but they should not be statistically different. Statistically sig- 
nificant sign changes are indications of misspecification. For example, if 7; is positive 
and significant but £; is negative and perhaps significant, the Tobit model is probably 
misspecified. 

As an illustration, in Example 15.2, we obtained the probit coefficient on nwifeinc 
as —.012, and the coefficient on kids/t6 was —.868. When we divide the corresponding 
Tobit coefficients by ¢ = 1,122.02, we obtain about —.0079 and —.797, respectively. 
Though the estimates differ somewhat, the signs are the same and the magnitudes are 
similar. 

It is possible to form a Hausman statistic as a quadratic form in (7 — B/G), but 
obtaining the appropriate asymptotic variance is somewhat complicated. (See Ruud, 
1984, for a formal discussion of this test.) Section 17.6 discusses more flexible models 
that may be needed for corner solution outcomes. 


17.5.4 Estimating Parameters under Weaker Assumptions 


It is possible to v N-consistently estimate f without assuming a particular distribu- 
tion for u and without even assuming that u and x are independent. Consider again 
the latent variable model, but where the median of u given x is zero: 


y*=xß +u, Med(u |x) = 0. (17.34) 
As we showed in Section 17.2, assumption (17.34) leads to 


Med(y |x) = max(0, xf). (17.35) 
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In Chapter 12 we showed how the analogy principle leads to least absolute deviations 
as the appropriate method for estimating the parameters in a conditional median. 
Therefore, assumption (17.34) suggests estimating f by solving 


N 
in . — max(0,x;f)|. 17.36 
mjn D ly; — max(0, x) (17.36) 


This estimator was suggested by Powell (1984). Since g(w, £) = |y — max(0,xf)| is a 
continuous function of f}, consistency of Powell’s estimator follows from Theorem 
12.2 under an appropriate identification assumption. Establishing \/N-asymptotic 
normality is much more difficult because the objective function is not twice con- 
tinuously differentiable with nonsingular Hessian. Powell (1984, 1994) and Newey 
and McFadden (1994) contain applicable theorems. 

An attractive feature of Powell’s approach is that Med(y|x) can be estimated 
without specifying D(u|x) beyond its having a zero conditional median. Of course, 
under (17.34) we cannot estimate other features of D(y |x): if we impose weak 
assumptions, then often we can learn only about limited features of a distribution. 
Because y is a corner solution, it is unclear how valuable estimating the conditional 
median is. For xB > 0, Med(y |x) is linear in x, and so the £, are the partial effects on 
the median for xf. One interesting aspect of equation (17.35) is that if we use the 
median for predication, we predict f; = 0 for all i such that xiB, < 0. This will not 
happen if we use the conditional mean, E(y| x), to predict y, as is easily seen for the 
type I Tobit. 

As in the probit case, one consequence of having consistent estimates of the f; that 
do not rely on full distributional assumptions is that we can estimate the directions of 
the APEs on the mean response, and also the relative partial effects for the continu- 
ous explanatory variables, under weaker assumptions. But we cannot estimate the 
magnitude of the partial effects on the means, and there is no easy way to obtain 
relative effects for discrete explanatory variables. (Because the partial effects of dis- 
crete variables cannot be obtained via calculus, relative effects involving a discrete 
explanatory variable generally depend on x.) 

The parametric nature of the Tobit model—that is, it fully specifies D(y | x)—is 
often stated as its major weakness. But for modeling corner solution outcomes, we do 
not get something for nothing. The Tobit model implies that we can estimate any 
feature of D( y| x) that we want, including P(y > 0| x), E(y |x, y > 0), E(y|x, y > 0) 
and, of course, Med(y|x) (which is necessarily given by (17.35)). Therefore, it is 
difficult to rank Powell’s semiparametric method and type I Tobit: Powell’s method 
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assumes little and delivers estimates of only a single feature of D(y| x), the condi- 
tional median, while Tobit assumes a lot but delivers the entire conditional distribu- 
tion. If the type I Tobit model is true, then the MLE estimates of the f; should not 
differ significantly from the censored least absolute deviations (CLAD) estimates, and 
so CLAD can be used as a rough specification check. In the next section we address 
the idea that perhaps the type I Tobit model is not flexible enough for a wide range of 
applications. 

Interestingly, if we modify the model to allow multiplicative heterogeneity of 
the form in Section 17.5.1, y = g-max(0,xf+ u), where q is independent of (x, u), 
then we cannot determine Med(y|x), and, generally, CLAD estimates nothing of 
interest—even if D(w|x) is homoskedastic normal. Yet, as we showed in Section 
17.5.1, E(y|x) has the usual Tobit form, and we could consistently estimate the 
parameters by NLS. Again, by focusing on the features of D(y |x) that are identified 
by different approaches, as opposed to parameters, we find that the choice between 
seemingly less parametric methods such as CLAD, and an uncommon method such 
as NLS applied to the Tobit functional form E(y |x) = ®(xf/c)xf + o¢(xP/c), is 
not as clear-cut as is often presented. The bottom line is that CLAD consistently 
estimates Med(y|x) if Med(y |x) = max(0,xf) while NLS consistently estimates 
E(y |x) if E(y|x) = ©(xB/o)xB + o6(xB/o). 

In some cases a quantile other than the median is of interest, and Powell’s 
approach applies when Quant, (u |x) = 0 for a quantile t, provided the absolute value 
function is replaced by the asymmetric loss (check) function; see Section 12.10. 
Buchinsky and Hahn (1998) offer a different approach to estimating quantiles. 

The Chung and Goldberger (1984) results on consistent OLS estimation of slope 
coefficients up to a common scale factor apply when y is given by (17.2) and u and 
x are uncorrelated; see also Section 15.7.5. But the assumptions are restrictive, and 
the OLS estimates at best deliver directions of effects and relative partial effects for 
the continuous covariates. The Stoker (1986) result also applies: if x is multivariate 
normal, then the linear regression y on x consistently estimates the partial effects 
averaged across the distribution of x. As in the binary response case, multivariate 
normality of x makes Stoker’s result of mostly theoretical interest. In Example 17.2 
we saw that there is strong evidence that at least one OLS estimate, even on a roughly 
continuous variable (educ), gave a very different estimate of the APE from the Tobit 
model. Of course, this does not mean that the Tobit model estimate is closer to the 
population APE, but it does suggest that linear models have limitations for corner 
solution responses. And Stoker’s result cannot be applied to discrete explanatory 
variables. 
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17.6 Two-Part Models and Type II Tobit for Corner Solutions 


As we saw in Section 17.2, the type I Tobit model implies that the partial effects of an 
explanatory variable on P(y > 0|x) and E(y |x, y > 0) must have the same signs. It 
is easy to imagine situations where this implication of the standard Tobit model 
could be false. For example, if y is amount of life insurance and age is an explanatory 
variable, age could have a positive effect, or at least an initially positive effect, on the 
probability of having a life insurance policy. But after a certain age, the amount of 
life insurance coverage might decline. Such a situation would violate the type I Tobit 
model assumptions. 

We also showed that the standard Tobit model implies that the relative effects 
of two continuous explanatory variables, say x; and x, on P(y>0|x) and 
E(y|x, y > 0) are identical (and equal to £;/f;,). For example, in a labor supply 
model where education and experience appear only in level form, if a year of educa- 
tion has twice the effect as a year of experience on the probability of labor force 
participation, then education necessarily has twice the effect on the expected hours 
worked for the subpopulation of those working. Even if we think that the partial 
effects of a variable on P(y > 0|x) and E(y|x, y > 0) have the same sign, we might 
not wish to impose the restriction that any two (continuous) explanatory variables 
have the same relative effects on these two different features of D(y |x). 

In this section, we consider models that are more flexible than the type I Tobit 
model. These models allow separate mechanisms to determine what we call the par- 
ticipation decision (y = 0 versus y > 0) and the amount decision (the magnitude of y 
when it is positive). As we will see, such models are fairly easy to estimate. Unfortu- 
nately, there is some confusion in the literature about the nature and interpretation of 
two approaches to extending the type I Tobit model. Fortunately, we can resolve 
some of the ambiguity in the literature by using a simple, unified setting. 

Let s be a binary variable that determines whether y is zero or strictly positive. It is 
also useful to introduce a continuously distributed, nonnegative latent variable, 
which we call w* in this section. Then we assume y is generated as 


y=s-w*. (17.37) 


It is important to remember that y—for example, annual hours worked—is the 
observed corner solution response, and it is features of D(y|x) that we would like to 
explain. Then, a model like (17.37) can arise if there are fixed costs that affect the 
decision to enter a particular state. For a married woman out of the labor force, the 
decision to enter the labor force may depend on a variety of considerations, including 
whether she has small children. The way that the presence of a small child affects the 
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labor force participation decision may be quite different from how it affects the 
decision on how much to work. Equation (17.37) is a convenient way to allow dif- 
ferent mechanisms for the participation and amount decisions. Other than s being 
binary and w* being continuous, there is another important difference between s and 
w*: we effectively observe s because s is observationally equivalent to the indicator 
1[y > 0] (because we assume P(w* = 0)). But w* is only observed when s= 1, in 
which case w* = y. 

To proceed in a parametric setting, we will assume that s and w* are specific 
functions of observable covariates and unobservables, and we will make (at least 
partial) distributional assumptions about the unobservables. But we can discuss, in a 
general way, different assumptions about how s and w* are related. As we will see, a 
useful assumption is that s and w* are independent conditional on explanatory vari- 
ables x, which we can write as 


D(w* | s,x) = D(w* |x). (17.38) 


When assumption (17.38) holds, the resulting model has typically been called a two- 
part model or hurdle model. The assumption is basically that, conditional on a set of 
observed covariates, the mechanisms determining s and w* are independent. One 
implication of (17.38) is that the expected value of y conditional on x and s is easy to 
obtain: 


E(y|x,s) = s: E(w* |x,s) = s- E(w* |x), (17.39) 
which, of course, can be derived under the conditional mean version of (17.38), 
E(w* | x,s) = E(w* |x). (17.40) 
When s = 1, (17.39) becomes 

E(y|x, y > 0) = E(w* |x), (17.41) 


so that the so-called conditional expectation of y (where we condition on y > 0) is 
just the expected value of w* (conditional on x). Further, the so-called unconditional 
expectation is 


E(y|x) = E(s|x)E(w* |x) = P(s = 1 | x)E(w* |x). (17.42) 


Although some, for example, Duan, Manning, Morris, and Newhouse (1984), have 
argued that two-part models do not impose (17.38), some sort of conditional inde- 
pendence is natural, even if it is only (17.40). This will become clear as our analysis 
unfolds. 
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A different class of models explicitly allows correlation between the participation 
and amount decisions (after conditioning on covariates). Unfortunately, such a 
model is often called a selection model. When attached to corner solution responses, 
the “selection model” label has shortcomings and has led to considerable confusion. 
As in other situations—such as treating a corner solution response as a data- 
censoring problem—the confusion arises because of statistical similarities that mod- 
els for corner solutions have with missing data models. But, remember, we have no 
missing data problem here. We are interested in explaining the corner solution re- 
sponse, y, and we assume it is always observable. We may use latent variable models 
to obtain D(y |x), but in the end, the latent variables are irrelevant. Because of its 
common use, we will use the selection model label. In Section 17.7.3 we study a ver- 
sion of the type II Tobit model. 

There has been much discussion in the literature on whether two-part and selection 
models can be put into a common framework, and whether selection models nest 
two-part models; see, for example, the survey and discussion in Leung and Yu (1996). 
Using (17.37) as a unified setting, we will see that, technically speaking, the type II 
Tobit model (applied to the logarithm of the response) does nest what is probably the 
most widely used two-part model, the lognormal hurdle model. But the type II Tobit 
model can be poorly identified without assuming that an exclusion restriction exists. 
Namely, we will often need to assume that there is at least one element of x that 
appears in P(s = 1|x) that does not appear in D(w* |x). Therefore, in a practical 
sense, the models offer two different approaches. We will have more to say on this 
issue when we cover the specific models. 


17.6.1 Truncated Normal Hurdle Model 


Cragg (1971) proposed a natural two-part extension of the type I Tobit model. The 
conditional independence assumption (17.38) is assumed to hold, and the binary 
variable s is assumed to follow a probit model, that is, 


P(s = 1| x) = B(x). (17.43) 


The unique feature of Cragg’s model is that the latent variable w* is assumed to have 
a truncated normal distribution with parameters that can vary freely from those in 
(17.43). The support of w* is (0,00), and so there is no possibility that the model 
predicts negative outcomes on y. We can specify the model in terms of (17.37) by 
defining w* = xf + u, where u given x has a truncated normal distribution with lower 
truncation point —xf. Because y = w* when y > 0, we can write the truncated nor- 
mal assumption in terms of the density of y given y > 0 (and x): 
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f(y |x, y > 0) = [O(xB/o)]'9l(y — xB)/o]/a, y > 0, (17.44) 


where the term [®(xf/c)|~' ensures that the density integrates to unity over y > 0. 
The density of y given x can be written succinctly as 


f(y |x) = [1 — ©(xp)] 12-9 { (xp) [©(xB/0)] y — xB) /o]/o} 1279, (17.45) 


where we must multiply f(y|x, y > 0) by P(y > 0|x) = ®(xy). Equation (17.45), 
which is how Cragg directly specified the model without introducing s and w*, makes 
it clear that the truncated normal hurdle (TNH) model reduces to the type I Tobit 
model when y = B/o. 

As usual, the log-likelihood function for a random draw 7 is obtained by plugging 
(x;, yi) into (17.45) and taking the log, so we have 


l:(0) = 1[y; = 0] log[l — ®(xiy)] + 1[y; > 0] logl®(xi7)] 
+ I[yi > O]{-log|®(x:B/o)| + log{¢[(vi — xiB)/o]} — log(a)}. 


Because the parameters y, f, and a are allowed to freely vary, it is easily seen that the 
MLE for y, f, is simply the probit estimator from probit of s; = 1[y; > 0] on x;. The 
MLEs of $ and a (or $ and a?) are also fairly easy to obtain using software that 
estimates truncated normal regression models. (We return to truncated normal re- 
gression in Chapter 19, but in the context of missing data. Here, the truncated normal 
distribution is convenient for estimating the density of y given x over strictly positive 
values.) Inference about the parameters is straightforward using Wald tests. Fin and 
Schmidt (1984) derive the LM test of the restrictions y = B/o; naturally, this test only 
requires estimation of the type I Tobit model. If one estimates Cragg’s model, then 
the LR statistic is easy to compute. One must be sure to add the two parts of the 
log-likelihood function for the hurdle model: that from the probit using all obser- 
vations and that from the truncated normal using the y; > 0 observations. Of 
course, one may reject the standard type I Tobit model for other reasons, such 
as heteroskedasticity or nonnormality in the latent variable model. And, as always, 
a Statistical rejection might not lead to practically different estimates of the partial 
effects. 

As an illustration, we estimate the truncated normal hurdle model for the data 
in Table 17.1. (The full results are reported in Section 17.6.3.) The log likelihood 
value is —3,791.95, compared with —3,819.09 for the standard (type I) Tobit model. 
Therefore, the LR statistic is about 54.28. With eight degrees of freedom, the p-value 
is zero to many decimal places. Therefore, the Tobit model is rejected against Cragg’s 
more general model. 
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The expected values for the truncated normal hurdle model are straightforward 
extensions of the standard Tobit model. First, the distribution D(y |x, y > 0) is 
identical in the two models, and so 


E(y|x, y > 0) = xB + aA(xB/o). (17.46) 


The difference is that P(y > 0|x) is allowed to follow an unrestricted probit model. 
Therefore, for Cragg’s model, 


E(y |x) = (xy) xB + oA(xB/o)]. (17.47) 
The partial effects no longer have the simple form in (17.16), but they can be com- 
puted easily from (17.15). In particular 


TOD play) + of(xB/a)] + D) OB), (17.48) 


Ox; 
where 0(z) = 1 — A(z)[z — A(z)], as in (17.12). Further, semielasticities on the condi- 
tional mean are easily obtained from log[E(y | x)] = log[®(xy)] + log[E(y | x, y > 0)], 
which implies ô log[E(y | x)]/éx; = ô log[®(xy)]/0x; + é log[E(y| x, y > 0)]/éx;. Thus, 
the semielasticity of E( y |x) with respect to x; is obtained by multiplying 


VjA(xy) + BjO(xB/o)/[xB + o2(xB/o)| (17.49) 


by 100. If x; = log(z;), then (17.49) is the elasticity of E(y |x) with respect to z;. We 
can insert the MLEs into any of the equations and average across x; to obtain an 
APE, average semielastisticity, or average elasticity. As in many nonlinear contexts, 
the bootstrap is a convenient method for obtaining valid standard errors. 

Because we can estimate E(y |x), we can compute the squared correlation between 
yi and Ê(y; | x;) = P(x; [xf + 64(x,B/6)] across all i as an R-squared measure. This 
goodness-of-fit statistic can be compared to the usual Tobit model or the models that 
we cover subsequently. We can do a similar calculation conditional on y > 0 using 
equation (17.46). 


17.6.2 Lognormal Hurdle Model and Exponential Conditional Mean 


Cragg (1971) suggested that a lognormal distribution can be used in place of the 
truncated Tobit, and the resulting hurdle model has been studied in detail by Duan, 
Manning, Morris, and Newhouse (1984). The participation decision is still governed 
by a probit model. One way to express y is 


y=s-w* = l[xy + v > 0] exp(xf + u), (17.50) 
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where (u,v) is independent of x with a bivariate normal distribution and u and v are 
independent. We have already assumed v has a standard normal distribution. There- 
fore, the assumption that makes (17.50) different from the truncated normal hurdle 
model is 


u|x ~ Normal(0, o°). (17.51) 


Assumption (17.51) means that the latent variable w* = exp(xf + u) has a lognormal 
distribution, and, because v and u are independent of x and each other, y conditional 
on (x, y > 0) has a lognormal distribution. Therefore, we call this model the log- 
normal hurdle (LH) model. The expected value conditional on y > 0 is 


E(y|x, y > 0) = E(w* |x, s = 1) = E(w* |x) = exp(xf + 07/2), (17.52) 
and so the “unconditional” expectation is 
E(y|x) = (xy) exp(xf + 07/2). (17.53) 


The semielasticity of E(y|x) with respect to x; is simply (100 times) y,A(xy) + 2; 
where 4(-) is the inverse Mills ratio. If x; = log(z;), this expression becomes the elas- 
ticity of E(y| x) with respect to zj. 

Estimation of the parameters is particularly straightforward. The density condi- 
tional on x is 


fO |x) = [1 — ©(xp)]'8 f(xy) 6[(log(.») — xB) /o]/(ay) $179, (17.54) 
which leads to the log-likelihood function for a random draw: 
l:(0) = 1[yi = 0] log[1 — ®(xiy)] + 1[y; > 0] logl®(xi7)] 

+ I[yi > O]{log(¢[(log(vi) — xiB)/e]) — log(a) — log(yi)}- (17.55) 


As with the truncated normal hurdle model, estimation of the parameters can pro- 
ceed in two steps. The first is probit of s; on x; to estimate y, and then £ is estimated 
using an OLS regression of log(y;) on x; for observations with y; > 0. The usual 
error variance estimator (or without the degrees-of-freedom adjustment), 67, is con- 
sistent for a”. The last term in (17.55), log(y;), does not affect estimation of the 
parameters, but it must be included in comparing log-likelihood values across differ- 
ent models for D(y |x). In particular, in order to compare Cragg’s truncated normal 
hurdle model and the lognormal hurdle model, the log likelihood for each i must be 
obtained as in (17.55). (Strictly speaking, to compare log-likelihood values, one 
should use the MLE for ø, which does not use a degrees-of-freedom correction. The 
difference should be minimal unless N is small.) 
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The lognormal hurdle model is easy to estimate, the parameters are easy to inter- 
pret, and partial effects and elasticities on P(y > 0|x), E(y|x, y > 0), and E(y|x) 
are easy to obtain. Nevertheless, if we are mainly interested in these three features of 
D(y|x), we can get by with weaker assumptions. Mullahy (1998) pointed out that 
the linear regression with log( y;) as the dependent variable may estimate E[log(y) | x, 
y > 0] consistently, but we may not be uncovering E(y |x, y > 0) if (17.51) fails. If 
we assume that u is independent of x but do not specify a distribution, then we can 
estimate E(y |x, y > 0) using Duan’s (1983) smearing estimate. In the exponential 
case, we obtain the scale factor, say 7, by averaging exp(ù;) over all i with y; > 0, 
where u; are the OLS residuals from log(y;) on x; using the y; > 0 data. Then, 
E(y|x, y > 0) = @ exp(xf), where Ê is the OLS estimator of log(y;) on x; using the 
y; > 0 subsample. 

An alternative way to relax (17.51) is to maintain normality but allow hetero- 
skedasticity, say, Var(w|x) = exp(xd). Then E(y|x, y > 0) = exp[xf + exp(xé) /2 
where, say, Ĥ and ô are the MLEs based on log(y) |x, y > 0 ~ Normal(xf, exp(x0)). 

A more direct approach that avoids specific distributional assumptions in the sec- 
ond tier is just to model E(y|x, y > 0) directly. It is natural to use an exponential 
function, 


E(y|x, y > 0) = exp(xf), (17.56) 


and this contains w* = exp(xf + u), with u independent of x, as a special case. We 
need not place any additional restrictions on D(y|x, y > 0). Given (17.56), we can 
use NLS using the y; > 0 observations to consistently estimate $. But NLS is likely 
to be inefficient because Var(y |x, y > 0) is unlikely to be constant. We could use 
a WNLS estimator, but a quasi-MLE in the linear exponential family (LEF), as 
we discussed in Section 13.11.3, is a nice, simple alternative. Using the gamma quasi- 
log-likelihood function is especially attractive as it produces a relatively efficient 
estimator when the variance is proportional to the square of the mean, which holds 
in the leading case w* = exp(xf + u) with u independent of x. We discuss such esti- 
mators in more detail in Chapter 18. 

Given probit estimates of P(y > 0|x) = ®(xy) and QMLE estimates of E(y |x, 
y > 0) = exp(xf), we can easily estimate E(y|x) = ®(xy) exp(xf) without addi- 
tional distributional assumptions. Computation of semielasticities and elasticities 
follows along the same lines as under the homoskedastic lognormality assumption. If 
our goal is to estimate partial effects on the two means, an approach that specifies 
parametric models for the minimal features of D(y|x) is attractive. See Mullahy 
(1998) for further discussion in the context of health economics. 
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17.6.3 Exponential Type II Tobit Model 


The two-part models in the previous two subsections assume that s and w* are inde- 
pendent conditional on the observed covariates, x, either in the full distributional 
sense or in the conditional mean sense. Generally, we might expect some common 
unobserved factors to affect both the participation decision (whether s is zero or one) 
and the amount decision (how large w* is). For example, in a model of married 
women’s labor supply, unobserved factors that affect the decision to enter the work- 
force might be correlated with factors that affect the hours decision. Fortunately, we 
can modify the lognormal hurdle model to allow conditional correlation between s 
and w*. 

We call the model in this subsection the exponential type H Tobit (ET2T) model. 
Before we derive the log-likelihood function, we need to understand where this model 
fits into the literature. Traditionally, the type II Tobit model has been applied to 
missing data problems—that is, where we truly have a sample selection issue. We 
return to this important application in Chapter 19. But, as we emphasized earlier, we 
do not have a missing data problem in the current setting: we have a corner solution 
response, and we have been exploring ways to model D(y |x) that are more flexible 
than the type I Tobit model. Thus far, in the context of equation (17.37), we have 
assumed conditional independence between s and w*. Now, we want to relax that 
assumption. We use the qualifier “exponential” to emphasize that, in (17.37), we 
should have w* = exp(xf + u); it will not make sense to have w* = xf + u, as is often 
the case in the study of type II Tobit models. After we cover the exponential version 
of the model, we will explain why a linear model for w* is inappropriate. 

With the model written in equation (17.37), we now allow u and v to be correlated. 
Because v has variance equal to one, Cov(u,v) = po, where p is the correlation 
between u and v and o? = Var(u). Obtaining the log likelihood in this case is 
a bit tricky. For simplicity, let m* = log(w*), so that D(m* |x) is Normal(xf, o°). 
Then log(yv) =m* when y > 0. Of course, we still have P(y = 0|x) = 1 — (xy). 
To obtain the density of y (conditional on x) over strictly positive values, we find 
f(y|x,y >0) and multiply it by P(y >0|x) = (xy). To find f(y|x, y > 0), 
we use the change-of-variables formula f(y|x, y > 0) = g(log(y) |x, y > 0)/y, 
where g(- |x, y > 0) is the density of m* conditional on y > 0 (and x). Obtaining 
g(m* |x, y > 0) = g(m*|x,s=1) is complicated by the correlation between 
u and v. One approach is to use Bayes’ rule to write g(m*|x,s=1)= 
P(s = 1|m*,x)h(m* |x)/P(s = 1|x) where A(m* |x) is the density of m* given x. 
Then, P(s = 1 | x)g(m*|x,s = 1) = P(s = 1|m*,x)h(m*|x), and this is the expres- 
sion we want to obtain the density of y given x for strictly positive y. Now, we can 
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write s = 1[xy + v > 0] = 1[xy + (p/a)u+e > 0], where we use v = (p/a)u+e and 
e|x,u ~ Normal(0, (1 — p?)). Because u = m* — xf, we have 


P(s = 1 | m*,x) = @([xy + (p/0)(m* — xp)|(1 — p?)-"). 


Further, we have assumed that /(m* |x) is Normal(xf,o7). Therefore, the density of 
y given x over strictly positive y is 


FOIX) = @([xy + (p/a)(m* — xB — P) Alog) — xB)/2)/ (oy). 
Combining this expression with the density at y = 0 gives the log likelihood as 

l:(0) = I [yi = 0] log] — ®(x:7)] 

+ 1: > O]{log[® (xy + (p/o(log(yi) — xiB)](1 — p?)-””) 

+ log|p((log(yi) — xiB)/o)] — log(a) — log(yi)}- (17.57) 


Many econometrics packages have this estimator programmed, although the empha- 
sis is on sample selection problems, and one must define log(y;) as the variable where 
the data are missing (when y; = 0). When p = 0, we obtain the log likelihood for the 
lognormal hurdle model from the previous subsection. (Incidentally, for a true miss- 
ing data problem, the last term in (17.57), log( y;), is not included. That is because in 
sample selection problems the log-likelihood function is only a partial log likelihood, 
where we truly do not observe a response variable for part of the sample. That is not 
the case here. Inclusion of log(y;) does not affect the estimation problem, but it does 
affect the value of the log-likelihood function, which is needed to compare across 
different models.) 

The ET2T model contains the conditional lognormal model from the previous 
subsection because both models assume that (u,v) is independent of x and jointly 
normally distributed; the conditional lognormal model makes the extra assumption 
that u and v are independent. This fact seems to imply that we should, at a minimum, 
always estimate the more general model to see if it is needed. But the issue is not so 
simple. It turns out that the model with unknown p can be poorly identified if the set 
of explanatory variables that appears in w* = exp(xf+u) is the same as the vari- 
ables in s = I[xy + v > 0]. One way to see the problem is to derive E[log(y) | x, y > 0]. 
We do this by first obtaining E(m*|x,s = 1). By iterated expectations, E(m* | x, s) = 
E[E(m* |x, v) | x, s] because s is a function of (x, v). But 


E(m* |x, v) = xB + E(u |x, v) = xf + E(u|v) = xP + nv, 


where y = po is the population regression coefficient from u on v. Therefore, 
E(m* | x,s) = xB + nE(v|x,s), and, as we showed in Section 17.2, E(v|x,s = 1) = 
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2(xy), where A(-) is the inverse Mills ratio. Therefore, E(m*|x,s = 1) = xB + nA(xy), 
and so 


Eflog(y) |x, y > 0] = x$ + (xy). (17.58) 


If we know y—a valid stance for identification analysis because we can estimate it 
consistently using probit—equation (17.58) nominally identifies $ and 7. But identi- 
fication is possible only because A(-) is a nonlinear function. If £ is unrestricted, then 
A(xy) is a function of x, just not an exact linear function. This kind of identification 
can lead to poor estimators in practice because 2(-) can be very close to linear over 
the appropriate range. In fact, if we do not impose a probit model on P(y > 0|x), 
identification would be lost because we would have to allow 4(-) to be from a class of 
functions that contain functions arbitrarily close to linear functions. 

As a practical matter, the simple two-step procedure suggested by (17.58) can lead 
to imprecise or unexpected estimates of $ and y. The two-step procedure obtains 7 
from probit of s; on x;, and then B and Ĥ are obtained from OLS of log(y;) on xi, 
A(x;?) using only observations with y; > 0. Heckman (1976) originally proposed this 
two-step procedure, although he had the sample selection problem more in mind. 
Generally, the two-step method is referred to as Heckman’s method or Heckit. We will 
study the method applied to missing data problems much more fully in Chapter 19. 

Of course, the two-step estimation method may poorly identify f and 7 simply 
because it does not efficiently use all of the information in D(y |x). But there are 
other indications that the general model is poorly identified. It can be shown that 


E(y|x) = (xy + 7) exp(xP + 07/2), (17.59) 


which is exactly of the same form as equation (17.53), where u and v are assumed 
to be independent. The only difference is the appearance of 7. However, because x 
always should include a constant, 7 is not separately identified by E(y |x) (and nei- 
ther is o”/2). If we based identification entirely on E(y |x), there would be no differ- 
ence between the lognormal hurdle model and the ET2T model when the same set of 
regressors appears in the participation and amount equations. 

While the previous discussion indicates that the model with p unknown may be 
poorly identified, all of the parameters are technically identified by the log likelihood 
in (17.57), and MLE is generally feasible. Unfortunately, as with the two-step pro- 
cedure, the estimates can be difficult to believe when the same set of regressors shows 
up in both parts of the model. 


Example 17.4 (Annual Hours Equation for Married Women): Table 17.1 reports 
linear model and Tobit estimates for married women’s labor supply. We now 
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Table 17.2 
Models for Married Women’s Labor Supply 


(1) (2) (3) 


Truncated Normal Lognormal Exponential 
Model Hurdle Hurdle Type II Tobit 
Participation equation 
nwifeinc —.012 (.005 —.012 (.005) —.0097 (.0043) 
educ .131 (.025 .131 (.025) .120 (.022 
exper .123 (.019 .123 (.019) .083 (.017 
exper? —.0019 (.0006) —.0019 (.0006) —.0013 (.0005) 
age —.088 (.015 —.088 (.015) —.033 (.008 
kidslt6 —.868 (119 —.868 (.119) —.504 (.107 
kidsge6 .036 (.043 .036 (.043) .070 (.039 
constant .270 (.509 .270 (.509) —.367 (.448 
Amount equation hours log(hours) log(hours) 
nwifeinc .153 (5.164) —.0020 (.0044) .0067 (.0050) 
educ —29.85 (22.84 —.039 (.020) —.119 (.024 
exper 72.62 (21.24 .073 (.018) —.033 (.020 
exper? —.944 (.609 —.0012 (.0005) .0006 (.0006) 
age —27.44 (8.29) —.024 (.007) .014 (.008 
kidslt6 —484.91 (153.79) —.585 (.119) .208 (.134 
kidsge6 —102.66 (43.54 —.069 (.037) —.092 (.043 
constant 2,123.5 (483.3) 7.90 (.43) 8.67 (.50) 
6 850.77 (43.80 .884 (.030) 1.209 (.051 
p — — —.972 (.010 
Log likelihood —3,791.95 —3,894.93 —3,877.88 
Number of women 753 753 753 


All estimates are from maximum likelihood, with standard errors in parentheses after coefficents. 
The headings for the amount equation are intended to emphasize that the log(ours) can be expressed as a 
linear function in the lognormal hurdle and exponential type II Tobit models. 


estimate the truncated normal hurdle model, the lognormal hurdle model, and the 
ET2T model. The results are given in Table 17.2. 

There are several interesting features of the estimates in Table 17.2. First, because 
the lognormal hurdle model is nested within the ET2T model, the log likelihood of 
the latter is necessarily larger than that of the former. The LR test decidedly rejects 
the null model with LR = 34.10 for the test of Ho : p = 0. But we should view the 
very large, negative value p = —.972 with suspicion. It seems very unlikely that the 
unobserved factors that positively affect the decision to enter the workforce have a 
strong negative effect on how much to work. 

Because of the correlation allowed between u and v in the ET2T model, it is 
not immediately obvious how each explanatory variable affects E(y|x, y > 0) or 
E(y|x). The positive coefficient on Aids/t6 for the amount equation seems odd, but, 
because Brians > 0, Ñ <O, Frias < 0, and A(-) has a negative slope, the estimated 
partial effect of kids/t6 on Eflog(hours) | x, hours > 0] is ambiguous. Equation (17.59) 


Corner Solution Responses 701 


shows that the effect on E(hours|x) is ambiguous, and we would have to plug in 
specific values of x to determine the sign of the partial effect at interesting values. The 
experience profile also seems unusual, although, again, it is difficult in the ET2T 
model to figure out how the inverted U-shape for the participation part and the 
U-shape for the amount part translate into partial effects. The difficulty in interpret- 
ing the estimates in the ET2T model, coupled with the unbelievable estimate of p, 
make this model undesirable for this application. (Unlike with the truncated normal 
and lognormal hurdle models, the label “amount equation” is less suitable as a label 
in the ET2T model: equation (17.58) makes it clear that both sets of parameters enter 
the expectation conditional on y > 0.) 

Based on the log likelihood, the truncated normal hurdle model fits considerably 
better than the lognormal hurdle model. We can apply Vuong’s (1989) test to see 
if the difference is statistically significant. Because the participation equations are 
identical probits in both models, we can only test the “amount” models on the 428 
observations with positive hours. The simplest way to implement the test is to regress 
i, — i» on a constant and perform a ¢ test, where, for observation i, i, is the log 
likelihood for the truncated normal model and I> is the log likelihood for the log- 
normal model. The average difference in the log likelihoods is .241 with standard 
error = .033, and so the difference is highly statistically significant. Therefore, we can 
at least reject the lognormal model as being the true model. 

Conditional on hours > 0, the truncated normal model fits the conditional mean 
better than the lognormal model, too. The squared correlation between hours; and 
the fitted values for the TNH model is about .138 (computed from (17.46)) and about 
.128 for the LH model (computed from (17.52)). 

We can also easily test the type I Tobit model against the TNH model by using the 
LR statistic. In this case, the usual Tobit model imposes eight restrictions of the form 
y; = B;/o. The LR statistic is LR = 2(3,819.09 — 3,791.95) = 54.28, which yields a 
p-value of essentially zero, and so the standard Tobit model is strongly rejected. 
However, it is interesting to note that the Tobit model fits better than the ET2T 
model, even though the latter model contains nine more parameters. 

As a practical matter, the TNH model allows certain variables to affect the par- 
ticipation and amount decisions differently. For example, education has a positive 
effect on the participation decision but appears to have no effect, or maybe a negative 
effect, on the hours decision conditional on participation. The number of older chil- 
dren does not seem to affect participation—which makes sense because the older 
children are in school for most of the year, making at least part-time work much 
more convenient—but having older children has a negative effect on amount of hours 
worked conditional on working. While having young children has large negative 
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effects on both the participation and amount decisions, the Tobit restriction, 
Ykiastó = Briase/F, Seems to be rejected, with Pxiasus = —-868 and Byidsi6/F = —.570. 


A complete analysis of this example would be to obtain partial effects of some key 
variables, perhaps nwifeinc and kids/t6, on the different conditional expectations and 
to see how the partial effects change across models. 

In the example just given, the TNH model provides the best fit and gives sensible 
estimates. In other cases, one can imagine the lognormal distribution could fit better 
for the amount distribution conditional on y > 0; such issues must be studied for 
each application. 

The ET2T model is more convincing when the covariates determining the partici- 
pation decision strictly contain those affecting the amount decision. Then, the model 
can be expressed as 


y = I[xy+v > 0]-exp(xif, + u), (17.60) 


where both x and x; contain unity as their first elements but x, is a strict subset of x. 
If we write x = (x), X2), then we are assuming y, # 0. Given at least one exclusion 
restriction, we can see from E[log(y) |x, y > 0] = xf, + yA(xy) that fı and y are 
likely to be better identified because 2(xy) is not an exact function of xı. (Identifica- 
tion of y is not an issue because it is always identified by the probit model for 
P(y > 0|x).) Unfortunately, where the exclusion restriction might come from is often 
unclear in applications. To use an exclusion restriction in Example 17.4, we need an 
observed variable that affects the labor force participation decision but not the 
amount decision. Perhaps a measure of the accessibility of day care can be viewed as 
affecting fixed costs of participating but not the amount decision. (The price of day 
care would typically affect the participation and amount decisions.) But such a vari- 
able is not available in the Mroz (1987) data set. 

In Example 9.5, where we estimated a simultaneous equations model for hours and 
log(wage), restricting ourselves to working women, we assumed that past workforce 
experience had no affect on hours; this allowed us to identify the hours equation. If we 
make a similar assumption and allow experience to affect participation but exclude it 
from xı, we obtain two exclusion restrictions because exper appears as a quadratic. 
Unfortunately, using these exclusion restrictions does not appreciably change the 
estimated correlation between v and u: p becomes —.963, and the estimated coef- 
ficients on the explanatory variables are similar to those in Table 17.2. Therefore, for 
this application, the ET2T model has some serious shortcomings even if we accept 
the exclusion restrictions. 

Given that the TNH model (and even the Tobit model) fits better than the ET2T 
model, it is tempting to apply the type II Tobit model to the level, y, rather than 
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log(y). After all, the TNH model can be expressed as y = 1[xy + v > 0] - (x$ + u). 
But in the TNH model, the truncated normal distribution of u at the value —xf 
ensures that xf + u > 0. If we apply the type H Tobit model directly to y, we must 
assume (u,v) is bivariate normal and independent of x. What we gain is that u and v 
can be correlated, but this comes at the cost of not specifying a proper density be- 
cause the T2T model allows negative outcomes on y. (This was not a problem when 
we applied the model to log(y).) Now, rather than (17.46) or (17.52)—both of which 
are guaranteed to be positive—we would have 


E(y|x, y > 0) = xB + nA(xy), (17.61) 


where y = po, p = Corr(u,v), and o? = Var(u). When we obtain either two-step 
estimates or MLEs of y, f, and y, nothing guarantees the right-hand side of (17.61) is 
positive for all x. Especially when p < 0, it is possible to get negative estimates of 
E(y|x, y > 0). Clearly negative estimates are possible in the case p = 0, as nothing 
guarantees xB > 0. Therefore, although the T2T model has been applied to corner 
solution responses—see, for example, Blank (1988) for hours worked and Franses 
and Paap (2001) for charitable contributions—it is not generally a good idea. As we 
will see in Chapter 19, the type II Tobit model was originally intended for sample 
selection problems. 

If we (inappropriately) apply the T2T model to hours, the value of the “log 
likelihood”’—that is, the value of the partial log likelihood obtained by treating the 
hours; = 0 observations as missing data—is —3,823.77, which is notably lower than 
the log likelihood value for the model it is supposed to nest, the TNH model (with log 
likelihood —3,791.95). This provides verification that the T2T “model” does not nest 
Cragg’s TNH model and in fact fits much worse. 


17.7 Two-Limit Tobit Model 


As mentioned in the introduction, some corner solution responses take on two values 
with positive probability. When the response variable is a fraction or a percent, the 
corners are usually at zero and one or zero and 100, respectively. But it is also pos- 
sible that institutional constraints impose corners at other values. For example, if 
workers are allowed to contribute at most 15% of their earnings to a tax-deferred 
pension plan, and y; is the fraction of income contributed for worker i, then the 
corners are at zero and .15. 

Generally, let a; < a be the two limit values of y in the population. Then the two- 
limit Tobit model is most easily defined in terms of an underlying latent variable. 
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y*=xß +u, u|x ~ Normal(0, o°) 
y=űj if y* <a 
(17.62) 
y=y* if a < y* <a 
y= if y* > a. 


The specification in equation (17.62) ensures that P(y = a1) > 0 and P(y = a) > 0 
but P(y = a) = 0 for a, < a < a. Therefore, this model is applicable only when we 
actually see pileups at the two endpoints and then a (roughly) continuous distribution 
in between. 

Using similar arguments for the type I Tobit model, the density of y is the same as 
y* for values in (aı,a2). Further, as you are asked to work out in Problem 17.3, 


P(y = ai |x) = ©((a1 — x£)/0) (17.63) 

P(y = a |x) = ®(—(a) — xf) /o). (17.64) 

It follows that the log-likelihood function for a random draw i is 

log f(y: |x; 0) = Lyi = ai] log[®((a — xiB)/o)| + I[vi = a] log[®(—(a2 — xiB)/o)] 
+ lar < yi < a] log((1/o)$((9i — xiB)/0)].- 


Many econometrics packages that estimate the standard Tobit model also allow 
specifying any lower and upper limit. The log likelihood is well behaved, and stan- 
dard asymptotic theory for MLE applies. 

As usual with nonlinear models, a difficult aspect is in knowing which estimated 
features to report. It can be shown (again see Problem 17.3) that 


E(y|x,a1 < y < dz) = xB + o[6((ai — xB)/o) — (la — xB)/o)]/ 
[O((a2 — xB)/) — (a1 — xB)/o)], (17.65) 


where the term after xf is the extension of the inverse Mills ratio. The so-called un- 
conditional expectation can be gotten from 


E(y|x) = aP(y =a |x) + P(a < y < a|x)E(y|x,a < y < a) + @P(y = a |x) 
= a, ®((aı — xB)/o) + Play < y < a | x)E(y|x,a1 < y < a) 
+ a,®(—(a — xf)/o). (17.66) 


Equations (17.65) and (17.66) are cumbersome to work with, but they do allow us to 
obtain predicted values for a vector x, once we have obtained the MLEs. 
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As with the single corner at zero, the partial effect of a continuous variable x; on 
E(y|x) simplifies to a remarkable degree: 


dE(y|x) 
Ox 


= [O((a2 — xB)/o) — ®((a1 — xB)/o)IB;- (17.67) 


This last expression makes partial effects at specific values of x, and APEs, especially 
easy to compute for continuous explanatory variables. For APEs we have 


(x Yo (a2 — xi) /6) — TET (17.68) 


where the scale factor is, of course, between zero and one. To determine how linear 
model estimates compare for estimating APEs, we should compare the OLS esti- 
mates for continuous variables directly to (17.68). APEs for binary variables should 
be obtained from equation (17.66), where we difference the two expected values at the 
two settings of the binary variable, and then average the differences; see (17.18) for 
the standard Tobit case. 


17.8 Panel Data Methods 


We now cover panel data methods for corner solution responses. We use the same 
notation as in previous chapters, namely, y; denotes the response for unit i at time ¢. 
The treatment is similar to that given in Section 15.8 for probit models. 


17.8.1 Pooled Methods 


We begin with the case where y;, > 0 and P(y;, = 0) > 0. Before covering the type I 
Tobit model, it is important to remember that, because we are interested in explain- 
ing yin it is acceptable in some cases to use linear regression methods. Just as in the 
cross section case, linear regressions are easy to interpret and might provide accept- 
able approximations to average effects. But the linear model might also not provide 
good estimates of partial effects at more extreme values. 

But it is also easy to apply the type I Tobit model to panel data. We now write 


Vir = Max(0,Xi B+ un),  t=1,2,...,T (17.69) 
uit | Xin ~ Normal(0, o°). (17.70) 


This model has several notable features. First, it does not maintain strict exogeneity 
of xj: uz is independent of Xy, but the relationship between u; and Xis, t 4s, is 
unspecified. As a result, X; could contain y,,, or variables that are affected by 
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feedback. A second important point is that the {u;,: t= 1,..., T} are allowed to be 
serially dependent, which means that the y,, can be dependent after conditioning on 
the explanatory variables. In short, equations (17.69) and (17.70) only specify a 
model for D(y;,| Xx), and X; can contain any conditioning variables (time dummies, 
interactions of time dummies with time-constant or time-varying variables, lagged 
dependent variables, and so on). 

The pooled estimator maximizes the partial log-likelihood function 


S04 B, 0°), 


t=1 


i= 


A 


where 7/;;(8,07) is the log-likelihood function given in equation (17.20). Computa- 
tionally, we just apply Tobit to the data set as if it were one long cross section of size 
NT. However, while the conditional information matrix equality holds for all ¢ under 
assumptions (17.69) and (17.70), a robust variance matrix estimator is needed to 
account for serial correlation in the score across f; see Sections 13.8.2 and 15.8.1. 
Robust Wald and score statistics can be computed as in Section 12.6. The LR sta- 
tistic based on the pooled Tobit estimation is not generally valid without further 
assumptions. 
In the case that the panel data model is dynamically complete, that is, 


D( Vix | Xir, Yi, t—1> Xi, t-l; ++ -) = D( Yy | Xi), (17.71) 


inference is considerably easier: all the usual statistics from pooled Tobit are valid, 
including LR statistics. Remember, we are not assuming any kind of independence 
across ¢; in fact, x;, can contain lagged dependent variables. It just works out that 
dynamic completeness leads to the same inference procedures one would use on in- 
dependent cross sections; see the general treatment in Section 13.8. 

A general test for dynamic completeness can be based on the scores §;;, as men- 
tioned in Section 13.8.3, but it is nice to have a simple test that can be computed from 
pooled Tobit estimation. Under assumption (17.71), variables dated at time ¢— 1 
and earlier should not affect the distribution of y,, once x; is conditioned on. There 
are many possibilities, but we focus on just one here. Define 7;,,-; = 1 if y;,-) =0 
and r;;-; =0 if y,;,_, > 0. Further, define ĉ; 1 = Y; 1 — Xi, if if y; > 0 and 
fit-1 = Oif y; ,_; = 0. Then estimate the following (artificial) model by pooled Tobit: 


Vig = Max 0, Xah + Viri t-1 + yo — ri, t-1)Ĝi, t-1 + errory] 


using time periods t=2,...,7, and test the joint hypothesis Ho : y; = 0, y, = 0. 
Under the null of dynamic completeness, errori = uj, and the estimation of tt; 1—1 


Corner Solution Responses 707 


does not affect the limiting distribution of the Wald, LR, or LM tests. In computing 
either the LR or LM test it is important to drop the first time period in estimating the 
restricted model with y; = y) = 0. Since pooled Tobit is used to estimate both the 
restricted and unrestricted models, the LR test is fairly easy to obtain. 

In some applications it may be important to allow interactions between time 
dummies and explanatory variables. We might also want to allow the variance of uj; 
to change over time to allow more flexibly for time heterogeneity. If c? = Var(uir), a 
pooled approach still works, but /;(B,a7) becomes /;(,¢7), and special software 
may be needed for estimation. 

The exact way that lagged dependent variables should appear in dynamic Tobit 
models is not clear. We might want to allow different effects of lagged participation 
and amounts. So, defining r; -1 = 1[yi,,-1 = 0], we might specify 


Vie = max(0, 20 + pri 1 + P21 — Fi, t-1)Vi, 1-1 + Ui). 


Any of the two-part models and the selection model we discussed in Section 17.6 
are easily adapted to panel data where we use pooled estimation. In Cragg’s trun- 
cated normal hurdle model and the lognormal hurdle model, we can allow lagged 
participation and amount decisions to appear separately in the current participation 
and amount equations. And, of course, we can allow lags of other variables, too. If 
the model is assumed to be dynamically complete—as it typically would be if we start 
adding lagged dependent variables—standard inference from the pooled estimation is 
valid. If we are estimating a static model or finite distributed lag model, serial corre- 
lation robust statistics should be used. The two-limit Tobit model from Section 17.7 
extends in a straightforward manner, too. Any of these methods is just a special case 
of the partial MLE results we discussed in Section 13.8: if the distribution is dynam- 
ically complete, then we can use the usual standard errors and test statistics; if it is 
not, all inference should be made robust to serial dependence. 


17.8.2 Unobserved Effects Models under Strict Exogeneity 


As in the case of probit, allowing for unobserved heterogeneity in Tobit models is 
tricky. Of course, the simple strategy of specifying a linear model is available. That is, 
we can write Vi = x;,f +c; + uj, and, under the assumption that x; is uncorrelated 
with uj; for all ż and r, estimate f by fixed effects. Of course, FE estimation ignores 
the restriction up > —(x;f + ci), and, if we think E( Yir | Xir, ¢;) = Xup + ci, we would 


be ignoring c; > —x;f for t= 1,..., T. Nevertheless, as we discussed in Section 15.8 
for binary responses, the linear model has some advantages: we can leave D(c; | x;) 
unspecified, and we can allow for general serial dependence in {up}. Plus, as usual, 


the Ê are easy to interpret, although they are at best approximations to APEs. 
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To exploit the corner solution nature of yx, we can instead use the unobserved 
effects Tobit model. (We leave the “type I’ designation implicit here.) We can write 
this model as 


Vie = MaX (0, Xup + ci tun), t= 1,2,...,T (17.72) 
uit | Xi, ci ~ Normal(0, o2) (17.73) 


2 u 


where c; is the unobserved effect and x; contains x; for all t£. Assumption (17.73) is a 
normality assumption, but it also imples that the x; are strictly exogenous condi- 
tional on c;. As we have seen in several contexts, this assumption rules out certain 
kinds of explanatory variables. 

Under assumptions (17.72) and (17.73), we can obtain E(y,|x;,c, y; > 0) and 
E(y;| X;, c) as in equations (17.10) and (17.14). These expectations, and therefore the 
partial effects, depend on the parameters f and ø. As in the unobserved effects 
models for binary responses, the partial effects on E( y; | x;,c, yt > 0) and E(y,| x;,c) 
also depend on the unobserved heterogeneity, c. If we can estimate the distribution of 
c then we can plug in interesting values, such as the mean, median, or various quan- 
tiles, in obtaining the partial effects. As we will see, just as in the unobserved effects 
probit model, we can estimate APEs under weaker assumptions. 

Rather than cover a standard random effects version, we consider a more general 
correlated random effects Tobit model that allows c; and x; to be correlated. To this 
end, assume, just as in the probit case, 


ci |x; ~ Normal(W + X;é, 02), (17.74) 


where a? is the variance of a; in the equation c; = Y + X;č + a;. We could replace X; 
with x; to be more general, but X; has at most dimension K. (As usual, x; would not 
include a constant, and time dummies would be excluded from X; because they are 
already in xj.) Under assumptions (17.42)-(17.44), we can write 


Vig = Max(0, Y + Xup + Xs + a; + ui) (17.75) 
uit | Xia; ~ Normal(0,02), ¢t=1,2,...,T (17.76) 
a; |x; ~ Normal(0, ož). (17.77) 


In our previous notation, assumptions (17.75) and (17.76) mean that D( yi; | Xi, ai) 
= Tobit ( + X#ß + X;é + a;, o2). The formulation in equations (17.75), (17.76), and 


(17.77) is very useful, especially if we assume that, conditional on (x;,a;) (equiv- 
alently, conditional on (x;,c¢;)), the {u;,} are serially independent: 


(uj1,-..,Ur) are independent given (x;, a;). (17.78) 
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If we set č = 0, equations (17.75) through (17.78) constitute the traditional random 
effects Tobit model. It is hardly more difficult to estimate the CRE version of the 
model, because software that estimates random effects Tobit models can be used to 
obtain consistent and //N-asymptotically normal MLEs of y, $, č, a2, and a2. The 
log-likelihood function for unit 7 is obtained first by multiplying the densities foi the 
Tobit(w + Xuß + Xič + ai, a2) model across ¢ and then integrating the product with 
respect to the Normal(0, a2) density—the distribution of a;. Then, of course, we take 
the logarithm of the result. The steps are very similar to obtaining the density for the 
random effects probit model (with the Chamberlain-Mundlak device of adding x; as 
a set of regressors). 

Although we show the CRE Tobit model with only time-varying regressors—likely 
including a full set of time dummies—we can always include time-constant regres- 
sors, too, just as in the CRE probit model. We might include time-constant controls 
to proxy for unobserved heterogeneity, or we might actually be interested in the 
effects of a time-constant variable, say w;, under the assumption D(c;|x;,w;) = 
D(c; | x;). Of course, including time-constant controls can often improve the fit of the 
model, too. Mechanically, we can include such variables in xy and simply drop the 
time averages associated with those variables (just as we do with aggregate time 
effects). We do not make this explicit in what follows. 

As in the probit case, given estimates of all parameters, we can estimate the mean 
and ae of c;. A consistent estimator of He = = E(c;) i l simply A. = ý + Xê, where 
x = N! YÀ x;. A consistent estimator of a2 is simply 62 = ĉ'Ês,Ê + ô2, where Ly is 
the sample variance matrix of {X; : i= 1,...,N}. If we define m(a, o?) = D(a/o)ja+ 
a¢(a/c) as in equation (17.30), then we can compute partial effects at ĝ,, by taking 
derivatives and changes of m(x,ĝ + Hes 6) with respect to elements of x,. In the case 
of a derivative, we simply get ®((x,f + f,)/ 6) B;. Furthermore, we might replace ĝ, 
with ĝ, + kô. for some value of k, say k = 1 or k = 2. The bootstrap can be used to 
obtain valid standard errors where, as always with large N and small 7, we resample 
the cross section units (keeping all T time periods for each i). 

Estimating APEs is also relatively simple. APEs (at x,) are obtained by finding 
E[m(x,B + c;,02)] and then computing partial derivatives or changes with respect to 
elements of x,. Since c; = w + X;č + a;, we have, by iterated expectations, 


E[m(x,B + ci, aż) = E{E|m(w + xB + Xič + ai, a) |x}, (17.79) 


where the first expectation is with respect to the distribution of c;. Since a; and x; 
are independent and a; ~ Normal(0, c2), the conditional expectation in equation 


(17.79) is obtained by integrating m(w + x,P + X;č + ai, a2) over a; with respect to the 
Normal(0, a2) distribution. Since m( + x; + X;é + a;,02) is obtained by integrating 


rma 
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max(0, Y + X: + X:č + a; + ur) with respect to u; over the Normal(0,¢7) distribu- 
tion, it follows that 


Elm( + x,B + ¥i€ + ai, 02) |x] = m(W + xB + X€,0, + 0;). (17.80) 


aa 


Therefore, the expected value of equation (17.80) (with respect to the distribution of 
X;) is consistently estimated as 


N 
N! XO mh + xB + Xê, 62 + 62). (17.81) 
i=l 
A similar argument works for E(y,|x,c,y, > 0): sum (W+x,f+x/é) + GAl(W+ 
x,B + X;€)/Gy| in expression (17.81), where A(-) is the inverse Mills ratio and ô? = 
a2) 22 
Gi + ô. 
We can relax assumption (17.78) and still obtain consistent, vV N-asymptotically 
normal estimates of the APEs. In fact, under assumptions (17.75)—(17.77), we can 
write 


Vir = max(0, Y + Xah + Xi + vir), (17.82) 
vir |X; ~ Normal(0, 2), t=1,2,...,T, (17.83) 


where vi = a; + Ui. Without further assumptions, the v;, are arbitrarily serially cor- 
related, and so maximum likelihood analysis using the density of y; given x; would 
be computationally demanding. However, we can obtain /N-asymptotically normal 
estimators by a simple pooled Tobit procedure of y, on 1, x, X} f=1,...,T,7= 
1,..., N. While we can only estimate g? from this procedure, it is all we need—along 
with w, Ê, and €—to obtain the APEs based on expression (17.81). The robust vari- 
ance matrix for partial MLE derived in Section 13.8.2 should be used for standard 
errors and inference. A minimum distance approach, analogous to the probit case 
discussed in Section 15.8.2, is also available. 

From equation (17.81), we can easily obtain scale factors for APEs of the contin- 
uous explanatory variables. For given x,, the estimated scale is M7! ae D(a + 
xB + X,€)/6,) where 62 = 6? + 62. We can further average across x; to obtain a 
scale factor for time period ¢, or even across all time periods to get a single scale 
factor. We can use (17.81) directly to estimate APEs for discrete changes. For ex- 
ample, if x; is binary, we evaluate m(W + xB + x,é, ô?) and xx = l and xx = 0 
and form the difference. Rather than set values for xn, ...,X7g-1, we could average 
across these as well, and even across all time periods. The next example illustrates 
how these calculations can be done. 
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Table 17.3 
Panel Data Models for Annual Women’s Labor Supply, 1980-1992 


(1) (2) (3) 


Model Estimation Method Linear Fixed Effects RE Tobit MLE CRE Tobit MLE 
nwifeinc —.775 (.343) —2.251 (.325) —1.554 (.382) 
ch0_2 —342.38 (26.65) —459.93 (22.67) —472.09 (23.03) 
ch3_5 —254.13 (25.88) —313.50 (18.82) —329.39 (19.49) 
ch6_17 —42.96 (14.89) —32.33 (9.82) —46.12 (10.90) 
marr —634.80 (286.17) —657.58 (48.93) —784.18 (155.01) 
constant 1,786 (247.30) 1,676.37 (39.28) 1,646.36 (45.26) 
Gu — 768.55 756.40 

Ĝu — 624.29 621.70 

scale factor — .811 .826 

Log likelihood — —70,782.09 —70,733.20 
Number of women 898 898 898 


All specifications include a full set of time dummies. 

The standard errors for FE estimation of the linear model are robust to arbitrary serial correlation and 
heteroskedasticity. 

For RE Tobit, ĉa = ĉc. 

The likelihood ratio test for exclusion of the five time averages in column (3) is 97.78, which gives a p-value 
of essentially zero. 


Example 17.5 (Panel Data Estimation of Annual Hours Equation for Women): We 
now apply the RE Tobit and CRE Tobit models to an annual hours equation using 
the Panel Study of Income Dynamics for 1980 to 1992 (PSID80_92.RAW). We in- 
clude other sources of income (nwifeinc), three categories of children (ch0_2, ch3_5, 
and ch6_17), and a binary marital status indicator (marr). We also include a full set 
of year dummies (not reported). For comparison purposes, a linear model estimated 
by fixed effects is also reported (Table 17.3). 

The three sets of estimates tell the same story in terms of directions of effects. Other 
income has a negative, statistically significant effect on annual hours, as do number 
of children—especially young children—and being married. The scale factors for the 
RE and CRE Tobit models have been computed by averaging across all 7 and t. 
Therefore, the APE for nwifeinc for RE Tobit is about .811(—2.25) = —1.82, and for 
the CRE Tobit model it is .826(—1.55) = —1.28. Both are larger in magnitude than 
the linear model coefficient obtained from FE, —.775. Certainly, including the time 
averages makes the APE from the Tobit and that from FE closer, but it does not 
entirely close the gap. The APE for marr from the CRE Tobit is about —695.26 
(obtained from averaging the differences in estimated means with marr = 1 and 
marr = 0). This is almost 10% higher in magnitude than the FE estimate. Of course, 
we cannot know which estimate is closer to the true APE; each approach has its 
drawbacks. 


712 Chapter 17 


A correlated random effects version of the two-limit Tobit model follows in the 
obvious way. As with the standard Tobit model, partial effects at the mean of c; and 
other values are easily obtained under the full set of assumptions, but where equation 
(17.72) is replaced with the equations that generate two corners. APEs can be esti- 
mated without assumption (17.78) because the mean function for a two-limit Tobit 
conditional on (xj, X;) has the same structure as the mean function conditional on 
(Xir, Ci), provided assumption (17.74) holds. The argument is essentially the same as 
in the standard unobserved effects Tobit model; see Problem 17.16 for details. 

It is also possible to consistently estimate # under the conditional median 
assumption 


Med( yi | Xi,- -< , XiT, Ci) = Med( yi | Xir, ci) = max(0, x8 + ci), (17.84) 


which imposes strict exogeneity (for the median) conditional on c;. Equivalently, we 
can write 


Yi = MAX (0, Xp + Ci + Ui), Med(u; | x;, ci) = 0, t= beses T (17.85) 


Honoré (1992) uses a clever conditioning argument on pairs of time periods, say t 
and s, that effectively eliminates c; under the exchangeability assumption that (tir, Uis) 
and (uis, Ui) are identically distributed conditional on (xj, Xis, ci). The details are too 
involved to cover here. A nice feature of Honoré’s method is that it identifies the 
parameters in Med( y; | Xi, ci) without restricting D(c; | x;). However, restrictions are 
imposed on the joint distribution of (uj, ..., uir), restrictions that rule out, for ex- 
ample, heteroskedasticity that depends on the covariates, unobserved effect, or just 
time. (Of course, the standard Tobit model imposes homoskedasticity, too, although 
that is fairly easy to relax in parametric contexts.) A less obvious problem with 
Honoré’s method is that, because we know nothing about the distribution of c; (either 
unconditionally or conditional on x;), we have no idea what are sensible values to 
plug in for c in the estimated median function max(0, x,f + c). The estimate Ê; is the 
estimated partial effect of xy on the median once x,B+ c > 0, but we cannot even 
estimate the fraction of the population where x,fB + c > 0. We might be satisfied by 
averaging c out of max(0,x,f + c)—that is, obtain an average structural function but 
where the median, rather than the mean, is defined as the “‘structure”’ of interest—but 
again, without information about the distribution of c;, we cannot compute this 
average. (If c; has a Normal(y.,07) distribution, then E,,(max(0,x,B + ci)) = 
D((ue + Xp) /Oe) (le + XB) + ccp. (Ue + X:8)/e-), but we do not have enough infor- 
mation to estimate u, and a2.) 

Given the variety of methods for estimating unobserved effects models for corner 
solution responses, we can produce a table very similar to Table 15.4 for binary re- 
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sponse. With Honoré’s estimator of $ playing the role of the fixed effects (FE) logit 
estimator, the table would only be slightly changed. For example, the CRE Tobit 
model under the full random effects assumptions allows estimation of partial effects 
at different values of c, and also estimation of APEs. If we drop the conditional in- 
dependence assumption, we can only identify APEs (that is, averaged across the dis- 
tribution of c;). As we just discussed, Honoré’s estimator allows estimation of f}, but 
not generally partial effects on the median or APEs on the median. Honoré’s 
approach does allow serial correlation in the idiosyncratic errors (unlike the FE logit 
estimator for binary response), but does impose exchangeability. 


17.8.3 Dynamic Unobserved Effects Tobit Models 
We now turn to a specific dynamic model, 
Vie = Max(0, Zð + Py Yi, -1 + Ci + Ui), (17.86) 


Uit | (Zi, Vi t1- -> Vio C7) ~ Normal (0, 62), tS E Ls (17.87) 


aU 


We can embellish this model in many ways. For example, the lagged effect of y; , 
can depend on whether y,,_; is zero or greater than zero. Thus, we might replace 
Pi Yi t1 bY Mri i-1 + PVC — fit-1) Yi t1, Where ry is a binary variable equal to unity if 
Vir = 0. We can allow a polynomial in y; ,_;. Or, we can let the variance of u; change 
over time. The basic approach does not depend on the particular model. 

The discussion in Section 15.8.4 about how to handle the initial value problem also 
holds here (see Section 13.9.2 for the general case). A fairly general and tractable 
approach is to specify a distribution for the unobserved effect, c;, given the initial 
value, y;9, and the exogenous variables in all time periods, z;. Let A(c | yo, z; y) denote 
such a density. Then the joint density of (y,,..., yr) given (yọ, Z) is 


o T 
[TOi roe cr 8)Ae| 90,257) de, (17.88) 
T% 71 

where f(¥,| ¥;-1,--- Y1; ¥0.Z,69) is the censored-at-zero normal distribution with 
mean z,6 + p,y,-; + ¢ and variance gł. A natural specification for h(c| yo,z;7) is 
Normal( + čo Yo + zé, o2), where a? = Var(c| yo, z). This leads to a fairly straight- 
forward procedure. To see why, write c; = Y + Čo Yio + Zič + a;, So that 


Vig = Max(0,W + Zi + Py Yi -1 + CoVig + Zič + ai + tir), (17.89) 


where the distribution of a; given (y,,z;) is Normal(0, o2), and assumption (17.87) 
holds with a; replacing c;. The density in expression (17.88) then has the same form 
as the random effects Tobit model, where the explanatory variables at time ¢ are 
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(Zit, Vi.1-1, Yo, Zi). The inclusion of the initial condition in each time period, as well as 
the entire vector z;, allows for the unobserved heterogeneity to be correlated with the 
initial condition and the strictly exogenous variables. Any software that estimates 
random effects Tobit models can be used to estimate all the parameters; in particular, 
we can easily test for state dependence (p, # 0). APEs can be estimated rather easily. 
For example, the APEs for E( y: | z;, y;-1,c) are consistently estimated from 


N 
N! So mb; + zÔ + Êy + oyi + zê, 62 +62), (17.90) 
i=l 

where different time intercepts, ý,, are explicitly shown for emphasis. (As with linear 
models and previous nonlinear models, one would usually include time dummies 
among the regressors.) As usual, we can take derivatives or changes with respect to 
elements of (z;, y1). Allowing more flexible functions of the initial condition in 
E(c;| Yio, Zi) is straightforward. Allowing heteroskedasticity in Var(c;| Vio, Zi), say 
Var(c; | Vio, Zi) = exp(y + Čoyio + Zič), or with more flexible functions in (yi, Zi), 
should not be too difficult. See Wooldridge (2005b) for further discussion and exten- 
sions, including to two-limit Tobit models. 

Using a censoring argument, Honoré (1993a) obtains orthogonality conditions 
that have zero mean at the true values of the parameters without making dis- 
tributional assumptions about c; or uj, in equation (17.86). Honoré’s assumptions put 
restrictions on the distribution of {u;,: t = 1,...,7}—-sufficient is that they are in- 
dependent, identically distributed conditional on (z;, yi0, c;) but he does not impose 
a parametric distribution. Therefore, one can test for state dependence fairly gener- 
ally (although heteroskedasticity in {ux} is ruled out). Honoré and Hu (2004) 
obtained sufficient conditions such that the parameters 6 and p, are identified by a 
set of moment conditions similar to those in Honoré (1993a), and they derive the 
consistency and /N-asymptotic normality of a GMM estimator. Unfortunately, the 
methods used by Honoré (1993a) and Honoré and Hu (2004) do not allow for time 
dummies, and so it could be difficult to distinguish state dependence from aggregate 
fluctuations. Further, the censoring arguments hinge critically on y;:-ı appearing 
linearly in (17.86). It is not clear how to extend their arguments to general functions 
of y;,;-1. At a minimum, one would need to know the signs of the coefficients on the 
extra functions (such as quadratics, or where we allow a separate effect from zero to a 
positive value). 

As things stand, semiparametric methods for estimating dynamic corner solution 
models do not uniformly relax assumptions of parametric approaches. Parametric 
approaches easily handle time dummies, nonlinear functions of y; ;-1, and parametric 
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heteroskedasticity. Therefore, it is difficult to know the proper reaction when dis- 
crepancies are found in the parameter estimates. Further, as with semiparametric 
methods for corner solution responses with strictly exogenous explanatory variables, 
the ability of semiparametric methods to consistently estimate parameters does not 
result in estimates of partial effects, and so the practical importance of any state 
dependence is not easily determined. 


Problems 


17.1. When y is a nonnegative corner solution response with corner at zero, one 
strategy that has been suggested is to use log(1 + y) as the dependent variable in a 
linear regression. 


a. Does the transformation log(1 + y) solve the problem of a pileup at zero? Ex- 
plain. Are there other reasons that this transformation might be useful? 


b. Suppose we assume the linear model 
log(1 + y) = x$ +r, E(r|x) = 0. 


How would you estimate #? Generally, can r be independent of x? Explain. 
c. Show that 


E(y|x) = exp(xf)E[exp(r) |x] — 1. 

If, as an approximation, we assume r and x are independent with y = Elexp(r)], 
find E(y|x). 

d. Under the independence assumption in part c, propose a consistent estimate of 7. 
(Hint: Use Duan’s (1983) smearing estimate.) 


e. Given B and Ĥ, how would you estimate E(y| x)? Is the estimate guaranteed to be 
nonnegative? Explain. 


f. Use the data in MROZ.RAW to estimate f} and y, where y = hours. The elements 
of x should be the same as in Table 17.1. What is 7 for this data set? Compute the 
fitted values for hours;, say hours;. Do you get any negative fitted values? 


g. Using the fitted values from part f, obtain the squared correlation between hours; 
and hours;. How does this R-squared measure compare to that for the linear model 
and the Tobit model in Table 17.1? 

h. Test the errors r; for heteroskedasticity by running the regression 7? on 1, (x;B), 
(x;B)°, where x,f are the fitted values from the OLS regression log(1 + hours;) on x;. 
(See Section 6.2.4.) Does it appear that r; is independent of x;? 
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17.2. Let y be a response variable taking on values in [0,1], where P(y = 0) = 0 
and P(y = 1) > 0. An example is, for a random sample of markets indexed by i, y; is 
the share of the largest firm; by definition, y; cannot be zero but could be one. 

a. Would a two-limit Tobit model be appropriate for y? Explain. 

b. Explain why a type I Tobit model applied to —log( y) makes logical sense. 


c. If you apply the suggestion in part b, would it be easy to recover E(y| x)? Show 
what you would have to do. 


17.3. Suppose that y given x follows a two-limit Tobit model as in Section 17.7, 
with limit points a) < a. 

a. Find P(y = a |x) and P(y = a |x) in terms of the standard normal cdf, x, B, and 
o. For a; < y < a, find P(y < y|x), and use this to find the density of y given x for 
dı < y < a). 

b. If z ~ Normal(0, 1), it can be shown that E(z|c1 < z < c2) = {A(c1) — ġ(c2)}/ 
{®(c2) — P(c1)} for cı < c2. Use this fact to find E(y| x,a; < y < a) and E(y |x). 
c. Consider the following method for estimating J. Using only the nonlimit obser- 
vations, that is, observations for which a, < y; < a, run the OLS regression of y; on 
x;. Explain why this does not generally produce a consistent estimator of $. 

d. Write down the log-likelihood function for observation i; it should consist of three 
parts. 


e. How would you estimate E(y|x,a1 < y < a) and E(y |x)? 
f. Show that 


TO L fofa- xp)/0] — Olla — x6); 


Why is the scale factor multiplying J; necessarily between zero and one? 

g. Suppose you obtain 7 from a standard OLS regression of y; on x;, using all 
observations. Would you compare ĵ; to the two-limit Tobit estimate, £;? What would 
be a sensible comparison? 


17.4. Use the data in JT[RAINI.RAW for this question. 


a. Using only the data for 1988, estimate a linear equation relating hrsemp to 
log(employ), union, and grant. Compute the usual and heteroskedasticity-robust 
standard errors. Interpret the results. 


b. Out of the 127 firms with nonmissing data on all variables, how many have 
hrsemp = 0? Estimate the model from part a by Tobit. Find the estimated average 
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partial effect of grant on E(hrsemp | employ, union, grant, hrsemp > 0) for the 127 
firms and union = 1. What is the APE on E(hrsemp | employ, union, grant)? 
c. Are log(employ) and union jointly significant in the Tobit model? 


d. In terms of goodness of fit for the conditional mean, do you prefer the linear 
model or Tobit model for estimating E(irsemp | employ, union, grant)? 


17.5. Use the data set FRINGE.RAW for this question. 
a. Estimate a linear model by OLS relating Arbens to exper, age, educ, tenure, 
married, male, white, nrtheast, nrthcen, south, and union. 


b. Estimate a Tobit model relating the same variables from part a. Why do you 
suppose the OLS and Tobit estimates are so similar? 


c. Add exper? and tenure? to the Tobit model from part b. Should these be included? 


d. Are there significant differences in hourly benefits across industry, holding the 
other factors fixed? 


17.6. Consider a Tobit model with an endogenous binary explanatory variable: 
yı = max(0, 21d, + %1 yə + u1) 
yı = 1[zô2 +v > 0}, 


where (u1, v2) is independent of z with a bivariate normal distribution with mean zero 
and Var(v2) = 1. If u; and vz are correlated, y, is endogenous. 


a. Find the density of yı given (z, y2). (Hint: First find the density of yı given (z, v2), 
which has the standard Tobit form.) 


b. For any observation i, write down the log-likelihood function in terms of the 
parameters 61, «1, 07, 62, and p,, where o? = Var(u) and p) = Cov(v2, u1). 


c. Discuss the properties of the following two-step method for estimating (01, %1): (1) 
Run probit of yn on z; and obtain the fitted probabilities, ®(z;62), i=1,...,N. (2) 
Run Tobit of y; on Zj, (z;6), and use the coefficients as estimates of (61, 01). 

d. For the binary response model yz = 1[zd2 + v2 > 0], the generalized residual for 
a random draw i is defined as grn = E(vj | yn, Zi). When v2 is independent of z 
with a standard normal distribution, it can be shown that gra = yng(ziô2)/®(z:ð2) — 
(1 — yi2)¢(zi62) /{1 — ®(z;62)]. Show that a variable addition test version of the score 
test for Ho : 7, = 0 can be obtained from Tobit of yj on Za, yn, grin and using a t 
test on grn. (Hint: The problem is simplified by reparameterizing the log likelihood 
by defining t? = o? — 7, so that 7, appears only multiplying v2.) 
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17.7. Suppose in a two-part model that P(y > 0|x) follows a probit model, 
E(y|x, y > 0) = exp(xf), and Var(y|x, y > 0) =7?[exp(xp)]?. Find Var(y|x). 
(Hint: Write y as in (17.37), where s and w* are independent conditional on x, with 
E(w* |x) = exp(xf) and Var(w* |x) = 7?[exp(xf)]*.) 

17.8. Consider three different approaches for modeling E(y|x) when y>0 is a 
corner solution outcome: (1) E(y |x) = xf; (2) E(y |x) = exp(xf); and (3) y given x 
follows a type I Tobit model. 

a. How would you estimate models 1 and 2? 

b. Obtain three goodness-of-fit statistics that can be compared across models; each 
should measure how much sample variation in y; is explained by E(y; | x;). 


c. Suppose, in your sample, y; > 0 for all i. Show that the OLS and Tobit estimates 
of $ are identical. Does the fact that they are identical mean that the linear model for 
E(y|x) and the Tobit model produce the same estimates of E(y |x)? Explain. 

d. If y > 0 in the population, does a Tobit model make sense? What is a simple 
alternative to the three approaches listed at the beginning of this problem? What 
assumptions are sufficient for estimating E(y |x)? 


17.9. Consider the Tobit model yı = max(0,x1f; + u1) where x; is a function of 
(Zi, y2) and y2 = zô + v2. Assume that E(z'v2) = 0 and that u | v2, z ~ Normal(6,v2 
+1 (v3 — 73), exp(Y; + @1v2)), where t3 = E(v}). 

a. Find D(yı | y2,z). 

b. Based on your answer in part a, propose a two-step method for estimating all of 
the parameters. 

c. Show how to estimate the average structural function after the two-step estima- 
tion. (Hint: This involves averaging across 02.) 

d. If vz|z ~ Normal(0, 73), what is an alternative method of estimation? 


e. Under the assumption on D(u; | v2), can u; have an unconditional normal distri- 
bution? Explain. 


17.10. a. Provide a careful derivation of equation (17.16). It will help to use the fact 
that dd(z)/dz = —z(z). 
b. Derive equation (17.59). 


17.11. Let y be a corner solution response, and let L(y| 1,x) = yọ + xy be the linear 
projection of y onto an intercept and x, where x is 1 x K. If we use a random sample 
on (x, y) to estimate yọ and y by OLS, are the estimators inconsistent because of the 
corner solution nature of y? Explain. 
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17.12. Use the data in APPLE.RAW for this question. These are phone survey 
data, where each respondent was asked the amount of “ecolabeled”’ (or “ecologically 
friendly’’) apples he or she would purchase at given prices for both ecolabeled apples 
and regular apples. The prices are cents per pound, and ecolbs and reglbs are both in 
pounds. 


a. For what fraction of the sample is eco/bs; = 0? Discuss generally whether ecolbs 
is a good candidate for a Tobit model. 


b. Estimate a linear regression model for eco/bs, with explanatory variables 
log(ecoprc), log(regprc), log( faminc), educ, hhsize, and num5_17. Are the signs of the 
coefficient for log(ecoprc) and log(regprc) the expected ones? Interpret the estimated 
coefficient on log(ecoprc). 


c. Test the linear regression in part b for heteroskedasticity by running the regression 
a on 1, ecélbs, ecélbs* and carrying out an F test. What do you conclude? 


d. Obtain the OLS fitted values. How many are negative? 


e. Now estimate a Tobit model for ecolbs. Are the signs and statistical significance of 
the explanatory variables the same as for the linear regression model? What do you 
make of the fact that the Tobit estimate on log(ecoprc) is about twice the size of the 
OLS estimate in the linear model? 


f. Obtain the estimated partial effect of log(ecoprc) for the Tobit model using equa- 
tion (17.16), where the x; are evaluated at the mean values. What is the estimated 
price elasticity (again, at the mean values of the x;)? 


g. Reestimate the Tobit model dropping the variable log(regprc). What happens to 
the coefficient on log(ecoprc)? What kind of correlation does this result suggest be- 
tween log(ecoprc) and log(regprc)? 
h. Reestimate the model from part e, but with ecoprc and regprc as the explanatory 
variables, rather than their natural logs. Which functional form do you prefer? (Hint: 
Compare log-likelihood functions.) 


17.13. Suppose that, in the context of an unobserved effects Tobit (or probit) panel 
data model, the mean of the unobserved effect, c;, is related to the time average of 
detrended Xi. Specifically, 


T 
a= fan 5 Xit — t4) 


=) 


C+ di, 


~ 


where z; = E(X), t= 1,..., T, and a; |x; ~ Normal(0, a2). How does this extension 
of equation (17.74) aleet. estimation of the unobserved effects Tobit (or probit) 
model? 


720 Chapter 17 


17.14. Consider the correlated random effects Tobit model under assumptions 
(17.72), (17.73), and (17.78), but replace assumption (17.74) with 


ci |X; ~ Normal[w + x,é, 02 exp(x;A)| 


See Problem 15.18 for the probit case. 

a. What is the density of y, given (x;,a;), where a; = c; — E(c;| xi)? 

b. Derive the log-likelihood function by first finding the density of (ya,..-, Yir) 
given x;. 

c. Assuming you have estimated £, a7, Y, č, 2, and A by CMLE, how would you 
estimate the APEs? 

17.15 Use the data in CPS91.RAW to answer this question. 


a. Estimate a Tobit model for hours with nwifeinc, educ, exper, exper’, age, kidlt6, 
and kidge6 as explanatory variables. What is the value of the log-likelihood function? 


b. Estimate Cragg’s lognormal hurdle model. Does it fit the conditional distribution 
of hours better than the Tobit model? 


c. Estimate the ET2T model (with all explanatory variables in both stages). Discuss 
how it compares to the model from part b. Do you reject the model from part b in 
favor of the ET2T model? 


d. Estimate Cragg’s truncated normal hurdle model. How does its fit compare with 
the previous models? 


17.16. Consider the CRE Tobit model in Section 17.8.2, written with latent variable 
Vit = Xup + Ci + Wit 

Yu=q fysn 

ye=Q if yy zq 

Yu = Ya th 1 < y} < MD, 


where qı < q2 are the two known limits. Make assumptions (17.73), (17.74), and 
(17.78). 


a. Obtain the log-likelihood function for estimating all of the parameters. 
b. How would you estimate E(c;) and Var(c;)? 
c. Describe how to obtain the APEs on E( yy | Xir, ci). 


d. How would your analysis change if you dropped assumption (17.78)? 
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17.17. Consider an unobserved effects Tobit model with a continuous endogenous 
explanatory variable: 


Vin = max(0, a Vi + Zind + Ci + tin) 
Yin = Zin + Cin + Uin 


ca = Yi + Zid) + aa 


cn = Wy + Zič + an 


Although the assumptions can be weakened somewhat, assume that (uin, Uir) | Zi, 
a; is bivariate normal with mean zero and constant conditional variance matrix 
(across ft, as well as not depending on the conditioning variables). Further, assume 
(ai1,4;2) |Z; is bivariate normal with mean zero and constant conditional variance 
matrix. Note that yj will not be strictly exogenous in the estimable equation, only 
contemporaneously. The mechanics are very similar to those given in Section 15.8.5 
for the probit model. 


a. Obtain a control function method for consistently estimating «, and 6). (Hint: It 
will help to define vj = aj + uin and vig = an + uin.) 

b. How would you obtain standard errors for the parameter estimators? 

c. How would you consistently estimate the average structural function (which is a 
function of (yn, Za ))? 


17.18. Consider a standard linear model where the endogenous explanatory vari- 
able, y2, is a corner solution: 


Yı = 20, + X1 y2 + u 
E(u |z) = 0, 


where z, is a strict subset of z. 


a. What additional assumptions ensure that the 2SLS estimator, using instruments z 
and a random sample, is consistent for the parameters? Does it matter that yz is a 
corner solution? 

b. Suppose you think D(y2|z) follows a Tobit model and Var(u |z) = 07. What 
might you do instead of 2SLS in part a, and why? State a minimal set of assumptions 
that ensure the estimator is consistent. 

c. Suppose yz = max(0,z6. + v2), v2|z~ Normal(0,73) and E(u |z,v2) = pyvr. 
Propose a control function method for estimating (6), %1). (For example, see Vella 
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(1993).) Does the CF approach change if (z1, y2) is replaced by some known func- 
tion, x, of (z1, y2) (for example, including quadratics and interactions)? 


d. Compare the merits of the estimation methods from parts a, b, and c. 


e. If in the setup of part c you assume (u1, v2) is independent of z with a joint normal 
distribution, what estimation method would you use? Provide the objective function. 


17.19. Let yı, y2,..., yg be a set of limited dependent variables representing a 
population. These could be outcomes for the same individual, family, firm, and so 
on. Some elements could be binary outcomes, some could be ordered outcomes, and 
others might be corner solutions. For a vector of conditioning variables x and 
unobserved heterogeneity c, assume that y1, y2,..., yg are independent conditional 
on (x,c), where f,(-|x,¢;yj). For example, if c is a scalar and yı is a binary re- 
sponse, fil- |x,c;y}) might represent a probit model with response probability 
O(xy} +c). 

a. Write down the density of y = (1, y2,.-., yG) given (x, c). 


b. Let A(-|x;6,) be the density of c given x, where 0, is the vector of unknown 
parameters. Find the density of y given x. Are the y, independent conditional on x? 
Explain. 


c. Find the log likelihood for any random draw (x;, y;). 


17.20. Use the data in PSID80_92.RAW to answer this question. 


a. Estimate the dynamic Tobit model in Section 17.8.3 using y; = Aours;,, with ele- 
ments of zy including nwifeinc;,, chO_2;, ch3_Si, ch6_17;,, and marry. Be sure to in- 
clude a full set of year dummies, too. 


b. Estimate the APE of nwifeinc in 1992 when hours; = 0. Average across the 
remaining explanatory variables. 

c. Estimate the APE of increasing ch0_2 from zero to one in 1992, averaging across 
all other explanatory variables. 


l 8 Count, Fractional, and Other Nonnegative Responses 


18.1 Introduction 


A count variable is a variable that takes on nonnegative integer values. Many vari- 
ables that we would like to explain in terms of covariates come as counts. A few 
examples include the number of times someone is arrested during a given year, 
number of emergency room drug episodes during a given week, number of cigarettes 
smoked per day, and number of patents applied for by a firm during a year. These 
examples have two important characteristics in common: there is no natural a priori 
upper bound, and the outcome will be zero for at least some members of the popu- 
lation. Other count variables do have an upper bound. For example, for the number 
of children in a family who are high school graduates, the upper bound is number of 
children in the family. 

If y is the count variable and x is a vector of explanatory variables, we are often 
interested in the population regression, E(y |x). Throughout this book we have dis- 
cussed various models for conditional expectations, and we have discussed different 
methods of estimation. The most straightforward approach is a linear model, 
E(y|x) = xf, estimated by ordinary least squares (OLS). For count data, linear 
models have shortcomings very similar to those for binary responses or corner solu- 
tion responses: because y > 0, we know that E(y |x) should be nonnegative for all x. 
If B is the OLS estimator, there usually will be values of x such that xf < 0, so that 
the predicted value of y is negative. Still, we have seen that the linear model, appro- 
priately viewed as a linear projection, can sometime provide good estimates of aver- 
age partial effects (APEs) on the conditional mean. 

For strictly positive variables, we often use the natural log transformation, log( y), 
and use a linear model. This approach is not possible in interesting count data 
applications, where y takes on the value zero for a nontrivial fraction of the popula- 
tion. Transformations could be applied that are defined for all y > 0—for example, 
log(1 + y)—but log(1 + y) itself is nonnegative, and it is not obvious how to recover 
E(y|x) from a linear model for E[log(1 + y)|x]. With count data, it is better to 
model E(y |x) directly and to choose functional forms that ensure positivity for any 
value of x and any parameter values. When y has no upper bound, the most popular 
of these is the exponential function, E(y |x) = exp(x). 

In Chapter 12 we discussed nonlinear least squares (NLS) as a general method for 
estimating nonlinear models of conditional means. NLS can certainly be applied to 
count data models, but it is not ideal: NLS is relatively inefficient unless Var(y | x) is 
constant (see Chapter 12), and all of the standard distributions for count data imply 
heteroskedasticity. 
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In Section 18.2 we discuss the most popular model for count data, the Poisson re- 
gression model. As we will see, the Poisson regression model has some nice features. 
First, if y given x has a Poisson distribution—which used to be the maintained 
assumption in count data contexts—then the conditional maximum likelihood esti- 
mators (CMLEs) are fully efficient. Second, the Poisson assumption turns out to be 
unnecessary for consistent estimation of the conditional mean parameters. As we will 
see in Section 18.2, the Poisson quasi-maximum likelihood estimator is fully robust 
to distributional misspecification. It also maintains certain efficiency properties even 
when the distribution is not Poisson. 

In Section 18.3 we discuss other count data models, including the binomial re- 
gression model when each count response has a known upper bound. Section 18.4 
covers the gamma regression model, which is better suited to nonnegative, continu- 
ous responses (but, given the robustness of the quasi-MLE, can be applied to any 
nonnegative response). In Section 18.5 we discuss how to handle endogenous ex- 
planatory variables with an exponential response function; in particular, the methods 
apply to count data and other nonnegative, unbounded responses. 

Fractional response variables are treated in Section 18.6, where again we focus on 
quasi-likelihood methods. In Section 18.7 we cover panel data extensions of several 
of the quasi-likelihood methods. 


18.2 Poisson Regression 


In Chapter 13 we used the basic Poisson regression model to illustrate maximum 
likelihood estimation. Here we study Poisson regression in much more detail, empha- 
sizing the robustness properties of the estimator when the Poisson distributional 
assumption is incorrect. 


18.2.1 Assumptions Used for Poisson Regression and Quantities of Interest 


The basic Poisson regression model assumes that y given x = (x1,...,Xg) has a 
Poisson distribution, as in El Sayyad (1973) and Maddala (1983, Section 2.15). The 
density of y given x under the Poisson assumption is completely determined by the 
conditional mean u(x) = E(y| x): 


F(y |x) = expa] y= 0,1,..., (18.1) 


where y! is y factorial. Given a parametric model for u(x) [such as u(x) = exp(xf)| 
and a random sample {(x;,y;): i= 1,2,...,N} on (x, y), it is fairly straightforward 
to obtain the CMLEs of the parameters. The statistical properties then follow from 
our treatment of MLE in Chapter 13. 


Count, Fractional, and Other Nonnegative Responses 725 


It has long been recognized that the Poisson distributional assumption imposes 
restrictions on the conditional moments of y that are often violated in applications. 
The most important of these is equality of the conditional variance and mean: 


Var(y|x) = E(y |x) (18.2) 


The variance-mean equality has been rejected in numerous applications, and later we 
show that assumption (18.2) is violated for fairly simple departures from the Poisson 
model. Importantly, whether or not assumption (18.2) holds has implications for how 
we carry out statistical inference. In fact, as we will see, it is assumption (18.2), not 
the Poisson assumption per se, that is important for large-sample inference; this point 
will become clear in Section 18.2.2. In what follows we refer to assumption (18.2) as 
the Poisson variance assumption. 
A weaker assumption allows the variance-mean ratio to be any positive constant: 


Var(y|x) = o7E(y|x) (18.3) 


where g? > 0 is the variance-mean ratio. This assumption is used in the generalized 
linear models (GLM) literature that we discussed in Section 13.11.3, and so we will 
refer to assumption (18.3) as the Poisson GLM variance assumption. The GLM liter- 
ature is concerned with quasi-MLE of a class of nonlinear models that contains 
Poisson regression as a special case. Here, we work through the details for Poisson 
regression. 

The case g? > 1 is empirically relevant because it implies that the variance is 
greater than the mean; this situation is called overdispersion (relative to the Poisson 
case). One distribution for y given x where assumption (18.3) holds with over- 
dispersion is what Cameron and Trivedi (1986) call NegBin I—a particular param- 
eterization of the negative binomial distribution. When g? <1 we say there is 
underdispersion. Underdispersion is less common than overdispersion, but under- 
dispersion has been found in some applications. 

There are plenty of count distributions for which assumption (18.3) does not 
hold—for example, the NegBin II model in Cameron and Trivedi (1986). Therefore, 
we are often interested in estimating the conditional mean parameters without speci- 
fying the conditional variance. As we will see, Poisson regression turns out to be well 
suited for this purpose. 

Given a parametric model m(x, f) for u(x), where $ is a P x 1 vector of parame- 
ters, the log likelihood for observation i is 


“(B) = yi logim(xi,B)] — m(x; B), (18.4) 
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where we drop the term log(y;!) because it does not depend on the parameters $ (for 
computational reasons dropping this term is a good idea in practice, too, as y;! gets 
very large for even moderate y;). We let Z c R? denote the parameter space, which 
is needed for the theoretical development but is practically unimportant in most 
cases. 

The most common mean function in applications is the exponential: 


m(x, B) = exp(xf), (18.5) 


where x is 1 x K and contains unity as its first element, and J is K x 1. Under 
assumption (18.5) the log likelihood is ¢;(f) = y:x;fP — exp(xif). The parameters in 
model (18.5) are easy to interpret. If x; is continuous, then 


0E 
Toiy = exp(xf)$;. 


which shows that the partial effects on E(y |x) depend on x. Further, 


_ (yx) 1 A logfE(»|x)] 
ay EO) ax, 


P; , 
and so 100%, is the semielasticity of E(y |x) with respect to x;: for small changes Axy, 
the percentage change in E(y |x) is roughly (100f;)Ax;. If we replace x; with log(x;), 
f; is the elasticity of E(y |x) with respect to xj. Using equation (18.5) as the model for 
E(y|x) is analogous to using log(y) as the dependent variable in linear regression 
analysis. 

Quadratic terms can be added with no additional effort, except in interpreting the 
parameters. In what follows, we will write the exponential function as in assumption 
(18.5), leaving transformations of x—such as logs, quadratics, interaction terms, and 
so on—implicit. 

Naturally, x can also include dummy variables or other discrete variables. The 
change in the expected value when, say, xx goes from ax to ax is 


exp(P, + Box2 +--+ +Bx_1Xx-1 + Be(ax + 1)) 
— exp( f1 + Box. + +++ + Bx_1xK-1+ Bax), 


while the proportionate change (starting at xx = ax) is simply exp(f,). Therefore, 
the percentage change in the expected value does not depend on the initial value of 
xx or the other covariates, and is simply 100 - exp(fx). 

Computing average partial effects (APEs) of an explanatory variable on the mean 
is straightforward with Poisson regression and an exponential mean function. As we 
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saw in Chapter 13—see also equation (18.4)—the first-order condition can be written 
as JA x/[y; — exp(x;ĝ)] = 0, and therefore, when x; = 1 (which should always be 
the case in practice), the residuals y; — exp(x;#) sum to zero, or > = j, where the 
fitted values are ĵ; = exp(x;B). Because the estimated partial effect of a continuous 
variable is exp(xp)B,, the average across the sample is simply YÊ; Therefore, as a 
rough comparison with linear model estimates, the Poisson coefficients can be multi- 
plied by the average outcome, y. For discrete changes, the differences in predicted 
values (for the two chosen values of xx) must be averaged across all i. 

Wooldridge (1997c) discusses functional forms other than the exponential that can 
be used in Poisson regression, including a model that nests the exponential, but an 
exponential regression function with flexible functions of the explanatory variables is 
often adequate. 


18.2.2 Consistency of the Poisson QMLE 


Once we have specified a conditional mean function, we are interested in cases where, 
other than the conditional mean, the Poisson distribution can be arbitrarily mis- 
specified (subject to regularity conditions). When y; given x; does not have a Poisson 
distribution, we call the estimator f that solves 


N 
pe “(B) (18.6) 
the Poisson quasi-maximum likelihood estimator (QMLE). A careful discussion of the 
consistency of the Poisson QMLE requires introduction of the true value of the 
parameter, as in Chapters 12 and 13. That is, we assume that for some value f, in 
the parameter space %, 


E(y |x) = m(x, By): (18.7) 


To prove consistency of the Poisson QMLE under assumption (18.7), the key is to 
show that f, is the unique solution to 


maz E/Z(B)}. (18.8) 


Then, under the regularity conditions listed in Theorem 12.2, it follows from this 
theorem that the solution to equation (18.6) is weakly consistent for f,. 

Wooldridge (1997c) provides a simple proof that £, is a solution to equation (18.8) 
when assumption (18.7) holds (see also Problem 18.1). As we discussed in Section 
13.11.3, this finding follows from general results on QMLE in the linear exponential 
family by Gourieroux, Monfort, and Trognon (1984a) (hereafter GMT, 1984a). 
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Uniqueness of f, must be assumed separately, as it depends on the distribution of x;. 
That is, in addition to assumption (18.7), identification of J, requires some restric- 
tions on the distribution of explanatory variables, and these depend on the nature 
of the regression function m. In the linear regression case, we require full rank of 
E(x/x;). For Poisson QMLE with an exponential regression function exp(xf), it can 
be shown that multiple solutions to equation (18.8) exist whenever there is perfect 
multicollinearity in x;, just as in the linear regression case. If we rule out perfect 
multicollinearity, we can usually conclude that $, is identified under assumption 
(18.7). 

It is important to remember that consistency of the Poisson QMLE does not re- 
quire any additional assumptions concerning the distribution of y; given x;. In par- 
ticular, Var(y;|x;) can be virtually anything (subject to regularity conditions needed 
to apply the results of Chapter 12), and y; need not even be a count variable. 


18.2.3 Asymptotic Normality of the Poisson QMLE 


If the Poission QMLE is consistent for f, without any assumptions beyond (18.7), 
why did we introduce assumptions (18.2) and (18.3)? It turns out that whether these 
assumptions hold determines which asymptotic variance matrix estimators and in- 
ference procedures are valid, as we now show. 

The asymptotic normality of the Poisson QMLE follows from Theorem 12.3. The 
result is 


VN(Ê — p.) + Normal(0, A;'B,A.'), (18.9) 
where 

Ay = E[-H,(8,)] (18.10) 
and 

B, = Els;(B,)si(Bo)"] = Varlsi(B,)]. (18.11) 


where we define A, in terms of minus the Hessian because the Poisson QMLE solves 
a maximization rather than a minimization problem. Taking the gradient of equation 
(18.4) and transposing gives the score for observation i as 


si(B) = Vgm(x;, P) [yi — m(x; B)]/m(x:, B). (18.12) 


It is easily seen that, under assumption (18.7), s;(£,) has a zero mean conditional on 
x;. The Hessian is more complicated but, under assumption (18.7), it can be shown 
that 
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—E|Hi(B,) |x;] = Vem(xi, Bo) Ve(xi, By) /1m(xi, Bo). (18.13) 


Then A, is the expected value of this expression (over the distribution of x;). A fully 
robust asymptotic variance matrix estimator for f} follows from equation (12.49): 


(a i (£s) (a), (18.14) 


where §; is obtained from equation (18.12) with Ê in place of f, and A; is the right- 
hand side of equation (18.13) with Ê in place of f,. This is the fully robust variance 
matrix estimator in the sense that it requires only assumption (18.7) and the regularity 
conditions from Chapter 12. 

The asymptotic variance of Ê simplifies under the GLM assumption (18.3). Main- 
taining assumption (18.3) (where a2 now denotes the true value of o°) and defining 
ui = yi — m(x;,B,), the law of iterated expectations implies that 


Bo = Elu7Vem;(B,)'Verni( By) /{mi( Bo) Y] 
= E[E(u? | x;) Vi Bo)’ Vpri( Bo) /{mi( Bo) F] = a2 Ao, 


since Blu; 2 |x;) = o2m;(B,) under assumptions (18.3) and (18.7). Therefore, A,'B,A,' 
=02A,', so we only need to estimate o2 in addition to obtaining A. A consistent 
estimator of g? is obtained from a? = E[u? /m;(ß,)], which follows from assumption 
(18.3) and iterated expectations. The usual analogy principle argument gives the 
estimator 


62 = NS Bt = NS (aa. (18.15) 
i=l i=l 


The last representation shows that 6? is simply the average sum of squared weighted 
residuals, where the weights are the inverse of the estimated nominal standard devi- 
ations. (As we discussed in Section 13.11.3, the weighted residuals ù; = û;/ vM; are 
sometimes called the Pearson residuals. In earlier chapters we also called them 
standardized residuals.) In the GLM Herai, a degrees-of-freedom adjustment is 
usually made by replacing N~! with (N — P)~' in equation (18.15); see also equation 
(13.91). 

Given 6? and A, it is straightforward to obtain an estimate of Avar(f) under 
assumption (18.3). In fact, we can write 


2 


a N -1 
Avar(Î) = 6?A!/N = 6? (>: wom /m (18.16) 
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Note that the matrix is always positive definite when the inverse exists, so it produces 
well-defined standard errors (given, as usual, by the square roots of the diagonal ele- 
ments). We call these the GLM standard errors. 

If the Poisson variance assumption (18.2) holds, things are even easier because g? 
is known to be unity; the estimated asymptotic variance of B is given in equation 
(18.16) but with ô? = 1. The same estimator can be derived from the MLE theory in 
Chapter 13 as the inverse of the estimated information matrix (conditional on the x;); 
see Section 13.5.2. 

Under assumption (18.3) in the case of overdispersion (o? > 1), standard errors of 
the Ê, obtained from equation (18.16) with ô? = 1 will systematically underestimate 
the asymptotic standard deviations, sometimes by a large factor. For example, if 
a? = 2, the correct GLM standard errors are, in the limit, 41 percent larger than the 
incorrect, nominal Poisson standard errors. It is common to see very significant 
coefficients reported for Poisson regressions—for example, Model (1993)—but we 
must interpret the standard errors with caution when they are obtained under 
assumption (18.2). The GLM standard errors are easily obtained by multiplying the 
Poisson standard errors by 6 = Vé2. The most robust standard errors are obtained 
from expression (18.14), as these are valid under any conditional variance assump- 
tion. In practice, it is a good idea to report the fully robust standard errors along with 
the GLM standard errors and ô. 

If y given x has a Poisson distribution, it follows from the general efficiency of the 
conditional MLE—see Section 14.5.2—that the Poisson QMLE is fully efficient in 
the class of estimators that ignores information on the marginal distribution of x. 

A nice property of the Poisson QMLE is that it retains some efficiency for certain 
departures from the Poisson assumption. The efficiency results of GMT (1984a) can 
be applied here: if the GLM assumption (18.3) holds for some g? > 0, the Poisson 
QMLE is efficient in the class of all QMLEs in the linear exponential family of dis- 
tributions. In particular, the Poisson QMLE is more efficient than the nonlinear least 
squares estimator (NLSE), as well as many other QMLEs in the LEF, some of which 
we cover in Sections 18.3 and 18.4. 

Wooldridge (1997c) gives an example of Poisson regression to an economic model 
of crime, where the response variable is number of arrests of a young man living in 
California during 1986. Wooldridge finds overdispersion: G is either 1.228 or 1.172, 
depending on the functional form for the conditional mean. The following example 
shows that underdispersion is possible. 


Example 18.1 (Effects of Education on Fertility): We use the data in FERTIL2. 
RAW to estimate the effects of education on women’s fertility in Botswana. The re- 
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Table 18.1 
OLS and Poisson Estimates of a Fertility Equation 


Dependent Variable: children 


Exponential 
Independent Variable Linear (OLS) (Poisson QMLE) 
educ —.0644 —.0217 
.0063) .0025) 
age .272 337 
.017) .009) 
age? —.0019 —.0041 
.0003) .0001) 
evermarr .682 .315 
.052) .021 
urban —.228 —.086 
.046) .019 
electric —.262 —.121 
.076) .034 
tv —.250 —.145 
.090) 041 
constant —3.394 —5.375 
.245) 141 
Log-likelihood value — —6,497.060 
R-squared .590 .598 
6 1.424 .867 
Number of observations 4,358 4,358 


sponse variable, children, is the number of living children. We use a standard expo- 
nential regression function, and the explanatory variables are years of schooling 
(educ), a quadratic in age, and binary indicators for ever married, living in an urban 
area, having electricity, and owning a television. The results are given in Table 
18.1. A linear regression model is also included, with the usual OLS standard 
errors. For Poisson regression, the standard errors are the GLM standard errors. 
All of the coefficients are statistically significant at low significance levels. (You are 
invited to compute the fully robust standard errors in each case. Interestingly, the 
heteroskedasticity-robust standard errors for OLS do not differ substantively from 
the usual OLS standard errors, and the fully robust standard errors for the Poisson 
QMLE are similar to the GLM standard errors.) 

Not surprisingly, the signs of the coefficients are the same in the linear and expo- 
nential models, but their interpretations differ. For example, the coefficient on educ 
in the linear model implies that each year of education reduces the predicted number 
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of children by about .064. So, if 100 women each get another year of education, we 
estimate about six fewer children among them. The coefficient on educ in the expo- 
nential model implies that each year of education is estimated to reduce the expected 
number of children by 2.2 percent. To make the exponential coefficients roughly 
comparable to the linear model coefficients, we can multiply the former by the sam- 
ple average of the dependent variable, children = 2.268, to obtain the APE. For educ, 
the APE is 2.268(—.0217) = —.0492, which is somewhat smaller in magnitude than 
the linear model estimate. For the binary variable tv, we compute the predicted value 
for each woman by setting tv equal to one and zero and taking the difference. The 
average across difference across all women is about —.309, which means that the 
average effect of television ownership is almost one-third of a child less; this estimate 
is somewhat higher in magnitude than the linear model estimate. 

The estimate of øg in the Poisson regression implies underdispersion: the variance is 
less than the mean. (Incidentally, the a’s for the linear and Poisson models are not 
comparable.) One implication is that the GLM standard errors are actually less than 
the corresponding Poisson MLE standard errors. 

For the linear model, the R-squared is the usual one. For the exponential model, 
the R-squared is computed as the squared correlation coefficient between children; 


and children; = exp(x,f). The exponential regression function fits only slightly better. 
18.2.4 Hypothesis Testing 


Classical hypothesis testing is fairly straightforward in a QMLE setting. Testing 
hypotheses about individual parameters is easily carried out using asymptotic ¢ sta- 
tistics after computing the appropriate standard error, as we discussed in Section 
18.2.3. Multiple hypotheses tests can be carried out using the Wald, quasi-likelihood 
ratio (QLR), or score test. We covered these generally in Sections 12.6 and 13.6, and 
they apply immediately to the Poisson QMLE. 

The Wald statistic for testing nonlinear hypotheses is computed as in equation 
(12.63), where V is chosen appropriately depending on the degree of robustness 
desired, with expression (18.14) being the most robust. The Wald statistic is conve- 
nient for testing multiple exclusion restrictions in a robust fashion. 

When the GLM assumption (18.3) holds, the QLR statistic can be used. Let B be 
the restricted estimator, where Q restrictions of the form ¢(f)=0 have been 
imposed. Let Ê be the unrestricted QMLE. Let Z (F) be the quasi-log likelihood for 
the sample of size N, given in expression (18.6). Let 6? be given in equation (18.15) 
(with or without the degrees-of-freedom adjustment), where the a; are the residuals 
from the unconstrained maximization. The QLR statistic, 
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OLR = 2| L(B) — L(p)|/e’, (18.17) 
converges in distribution to Xo under Ho, under the conditions laid out in Section 
12.6.3. The division of the usual likelihood ratio statistic by 6? provides for some 
degree of robustness. If we set G7 = 1, we obtain the usual LR statistic, which is valid 
only under assumption (18.2). There is no usable quasi-LR statistic when the GLM 
assumption (18.3) does not hold. 

The score test can also be used to test multiple hypotheses. In this case we estimate 
only the restricted model. Partition £ as (a',y’)’, where a is Pı x 1 and y is P2 x 1, 
and assume that the null hypothesis is 


Ho: 75 =F, (18.18) 


where 7 is a Py x 1 vector of specified constants (often, 7 = 0). Let Š be the estimator 
of f obtained under the restriction y = 7 [so B = (a’, 9’)'], and define quantities under 
the restricted estimation as m; = m(x;,B), ùi = yi — mj, and Vem; = (Vai, Vi) = 
Vpm(Xi, B). Now weight the residuals and gradient by the inverse of nominal Poisson 
standard deviation, estimated under the null, 1//m;: 


ii; = ù; / v/i, Vg = Vari / vV i, (18.19) 


so that the &; here are the Pearson residuals obtained under the null. A form of the 
score statistic that is valid under the GLM assumption (18.3) (and therefore under 
assumption (18.2)) is NR? from the regression 


ii, on Vym;,  i=1,2,...,N, (18.20) 


where R? denotes the uncentered R-squared. Under Ho and assumption (18.3), 
NR? ~ y3,. This is identical to the score statistic in equation (12.68) but where we 
use B = 6A, where the notation is self-explanatory. For more, see Wooldridge 
(1991a, 1997c). 

Following our development for nonlinear regression in Section 12.6.2, it is easy to 
obtain a test that is completely robust to variance misspecification. Let r; denote the 
1 x Pz residuals from the regression 


Vian; on Vz ii. (18.21) 


In other words, regress each element of the weighted gradient with respect to the 
restricted parameters on the weighted gradient with respect to the unrestricted 
parameters. The residuals are put into the 1 x P vector r;. The robust score statistic 
is obtained as M — SSR from the regression 
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lonaf;,  i=1,2,...,N, (18.22) 


where uF; = (U;Fi1, UjT;2,...,UjF;p,) is a 1 x Pz vector. Alternatively, we can regress i; 
on f; and use a heteroskedasticity-robust Wald statistic for joint significance of rj. 

As an example, consider testing Ho : y= 0 in the exponential model E(y| x) = 
exp(xf) = exp(x1a@ + x27). Then Vgm(x, B) = exp(xf)x. Let à be the Poisson QMLE 
obtained under y = 0, and define m; = exp(x;ă), with ŭ; the residuals. Now V,m; = 
exp(x;1@)xi1, V; = exp(xi1@)xi2, and Vgm; = m;x;/\/m; = \/m;x;. Therefore, the 
test that is valid under the GLM variance assumption is NR? from the OLS regres- 
sion a; on ./mjx;, where the ñ; are the weighted residuals. For the robust test, first 
obtain the 1 x P» residuals f; from the regression /7m;x;2 on \/m;X;1; then obtain the 
statistic from regression (18.22). 


18.2.5 Specification Testing 


Various specification tests have been proposed in the context of Poisson regression. 
The two most important kinds are conditional mean specification tests and condi- 
tional variance specification tests. For conditional mean tests, we usually begin with 
a fairly simple model whose parameters are easy to interpret—such as m(x, B) = 
exp(xf)—and then test this against other alternatives. Once the set of conditioning 
variables x has been specified, all such tests are functional form tests. 

A useful class of functional form tests can be obtained using the score principle, 
where the null model m(x, f) is nested in a more general model. Fully robust tests 
and less robust tests are obtained exactly as in the previous section. Wooldridge 
(1997c, Section 3.5) contains details and some examples, including an extension of 
RESET to exponential regression models. 

Conditional variance tests are more difficult to compute, especially if we want to 
maintain only that the first two moments are correctly specified under Ho. For ex- 
ample, it is very natural to test the GLM assumption (18.3) as a way of determining 
whether the Poisson QMLE is efficient in the class of estimators using only assump- 
tion (18.7). Cameron and Trivedi (1986) propose tests of the stronger assumption 
(18.2) and, in fact, take the null to be that the Poisson distribution is correct in its 
entirety. These tests are useful if we are interested in whether y given x truly has a 
Poisson distribution. However, assumption (18.2) is not necessary for consistency or 
relative efficiency of the Poisson QMLE. 

Wooldridge (1991b) proposes fully robust tests of conditional variances in the 
context of the linear exponential family, which contains Poisson regression as a spe- 
cial case. To test assumption (18.3), write u; = y; — m(X;, Po) and note that, under 
assumptions (18.3) and (18.7), u? — o2m(x;,8,) is uncorrelated with any function of 


Count, Fractional, and Other Nonnegative Responses 735 


x;. Let h(x;, f) be a 1 x Q vector of functions of x; and f, and consider the alterna- 
tive model 


E(u; | x;) = o,m(x;, Bo) a h(x;, B,)do- (18.23) 


For example, the elements of h(x;, 8) can be powers of m(x;, p). Popular choices are 
unity and {m(x;,f)}’. A test of Ho : ôo = 0 is then a test of the GLM assumption. 
While there are several moment conditions that can be used, a fruitful one is to use 
the weighted residuals, as we did with the conditional mean tests. We base the test on 


No 5 h/t) {(@? — 67 rj) / Mu} = N! saa - ô’), (18.24) 
i=l i=1 


where h; = h;/mm,; and a; = û,/ vfu. (Note that h; is weighted by 1//;, not 1/77.) 
To turn this equation into a test statistic, we must confront the fact that its stan- 
dardized limiting distribution depends on the limiting distributions of VN (Ê — B,) 
and /N(é6? — a2). To handle this problem, we use a trick suggested by Wooldridge 
(1991b) that removes the dependence of the limiting distribution of the test statistic 
on that of VN (ô? — a2): replace h; in equation (18.24) with its demeaned counter- 
part, č; = h; — h, where h is just the 1 x Q vector of sample averages of each element 
of h;. There is an additional purging that then leads to a simple regression-based 
statistic. Let Vm; be the unweighted gradient of the conditional mean function, 
evaluated at the Poisson QMLE 8, and define Vpm; = Vg / vu, as before. The fol- 
lowing steps come from Wooldridge (1991b, Procedure 4.1): 


1. Obtain 6? as in equation (18.15) and A as in equation (18.16), and define the 
P x Q matrix J = ô? (N! YOA, Vemn't;/m). 
2. For each i, define the 1 x Q vector 


3; = (ü? — ô’); — SA“, (18.25) 


where §; = Vgmju; is the Poisson score for observation i. 


3. Run the regression 
1 on Z;, =l N: (18.26) 


Under assumptions (18.3) and (18.7), N — SSR from this regression is distributed 
asymptotically as 76. 


The leading case occurs when m; = exp(x;ĝ) and Vym; = exp(x,B)x; = m;x;. The 
subtraction of §/A~!J in equation (18.25) is a simple way of handling the fact that the 
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limiting distribution of VN (Ê — B,) affects the limiting distribution of the unadjusted 
statistic in equation (18.24). This particular adjustment ensures that the tests are just 
as efficient as any maximum-likelihood-based statistic if 2 = 1 and the Poisson 
assumption is correct. But this procedure is fully robust in the sense that only 
assumptions (18.3) and (18.7) are maintained under Ho. For further discussion the 
reader is referred to Wooldridge (1991b). 

In practice, it is probably sufficient to choose the number of elements in Q to be 
small. Setting hj = (1,77), so that h; = (1/71;, ñ), is likely to produce a fairly power- 
ful two-degrees-of-freedom test against a fairly broad class of alternatives. 

The procedure is easily modified to test the more restrictive assumption (18.2). First, 
replace a” everywhere with unity. Second, there is no need to demean the auxiliary 
regressors h; (so that now h; can contain a constant); thus, wherever f; appears, sim- 
ply use h;. Everything else is the same. For the reasons discussed earlier, when the 
focus is on E(y|x), we are more interested in testing assumption (18.3) than 
assumption (18.2). 


18.3 Other Count Data Regression Models 


18.3.1 Negative Binomial Regression Models 


The Poisson regression model nominally maintains assumption (18.2) but retains 
some asymptotic efficiency under assumption (18.3). A popular alternative to the 
Poisson QMLE is full maximum likelihood analysis of the NegBin I model of 
Cameron and Trivedi (1986). NegBin I is a particular parameterization of the nega- 
tive binomial distribution. An important restriction in the NegBin I model is that it 
implies assumption (18.3) with a? > 1, so that there cannot be underdispersion. (We 
drop the “o” subscript in this section for notational simplicity.) Typically, NegBin I 
is parameterized through the mean parameters f and an additional parameter, 
n? > 0, where o? = 1 + n°. 

What are the merits of using NegBin I? On the one hand, when f and y? are esti- 
mated jointly, the MLEs are generally inconsistent if the NegBin I assumption fails. 
On the other hand, if the NegBin I distribution holds, then the NegBin I MLE is 
more efficient than the Poisson QMLE (this conclusion follows from Section 14.5.2). 
Still, under assumption (18.3), the Poisson QMLE is more efficient than any estima- 
tor that requires only the conditional mean to be correctly specified for consistency. 
On balance, because of its robustness, the Poisson QMLE has the edge over NegBin I 
for estimating the parameters of the conditional mean. If conditional probabilities 
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need to be estimated, then a more flexible model than the Poisson distribution is 
probably warranted. 

Other count data distributions imply a conditional variance other than assumption 
(18.3). A leading example is the NegBin II model of Cameron and Trivedi (1986). 
The NegBin II model can be derived from a model of unobserved heterogeneity in a 
Poisson model. Specifically, let c; > 0 be unobserved heterogeneity, and assume that 


Yi | Xi, ci ~ Poisson|c;m(x;, B)]. 


If we further assume that c; is independent of x; and has a gamma distribution with 
unit mean and Var(c;) = 77, then the distribution of y; given x; can be shown to be 
negative binomial, with conditional mean and variance 


E(y;| xi) = m(xi, p), (18.27) 
Var(y;| xi) = E[Var(y; | xi, ci) | xi] + Var[E(y; | x; ¢7) | xi] 
= m(x; B) +n? max p), (18.28) 


so that the conditional variance of y; given x; is a quadratic in the conditional mean. 
Because we can write equation (18.28) as E(y;|x;)[1 + y?E(y;|x;)], NegBin II also 
implies overdispersion, but where the amount of overdispersion increases with 


E(y;| Xi). 
The log-likelihood function for observation i is 
=) A 
li 2 = -2 ] 1 | jl | m(Xx;, B) | 
(Bo) = 1 N08 apt 8 Ge mac 
+logil(y,+07°)/Ta™)], (18.29) 


where T(-) is the gamma function defined for r > 0 by T(r) = fj z"! exp(—z) dz. 

You are referred to Cameron and Trivedi (1986) for details. The parameters f and 
n° can be jointly estimated using standard maximum likelihood methods. 

It turns out that, for fixed 77, the log likelihood in equation (18.29) is in the linear 
exponential family; see GMT (1984a). Therefore, if we fix y? at any positive value, 
say 77”, and estimate $ by maximizing XD; 4(B,777) with respect to £, then the 
resulting QMLE is consistent under the conditional mean assumption (18.27) only: 
for fixed 77, the negative binomial QMLE has the same robustness properties as the 
Poisson QMLE. (Notice that when 4? is fixed, the term involving the gamma func- 
tion in equation (18.29) does not affect the QMLE.) 
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The structure of the asymptotic variance estimators and test statistics is very simi- 
lar to the Poisson regression case. Let 


Ô; =m, + 0 M? (18.30) 


be the estimated nominal variance for the given value 77. We simply weight the 
residuals ù; and gradient Vg; by 1/V%;: 


uj = ti/ Vi, Vpn; = Vg / v/i. (18.31) 


For example, under conditions (18.27) and (18.28), a valid estimator of Avar(ĝ) is 


ie -1 
(>: Vor! Vg ja) i 
i=l 
If we drop condition (18.28), the estimator in expression (18.14) should be used but 
with the standardized residuals and gradients given by equation (18.31). Score sta- 
tistics are E in the same way. 

When 7” is set to unity, we obtain the Bones QMLE. A better approach is to 
replace 7 by a first-stage estimate, say 77, and then estimate $ by two-step QMLE. 
As we discussed in Chapters 12 and 13, somnetiiies the asymptotic distribution of the 
first-stage estimator needs to be taken into account. A nice feature of the two-step 
QMLE in this context is that the key condition, assumption (12.37), can be shown 
to hold under assumption (18.27). Therefore, we can ignore the first-stage estimation 
of 77. 

Under assumption (18.28), a consistent estimator of 7? is easy to obtain, given an 
initial estimator of f (such as the Poisson QMLE or the geometric QMLE). Given 
Ê, form m; and û; as the usual fitted values and residuals. One consistent estimator 
of y? is the coefficient on M? in the regression (through the origin) of ue — ñ; on m?; 
this is the estimator suggested by Gourieroux, Monfort, a Trognon (1984b) and 
Cameron and Trivedi ve An alternative estimator of 7”, which is closely related 
to the GLM estimator of g? suggested in equation (18.15), is a W least squares 
(WLS) estimate, which can be obtained from the OLS regression ü? — 1 on m;, where 
the a are residuals i weighted by m, -1/2 The resulting two-step estimator of f is 
consistent under assumption (18.7) only, so it is just as robust as the Poisson QMLE. 
Because Ê is consistent without (18.28), it makes sense to use fully robust standard 
errors and test statistics. If assumption (18.3) holds, the Poisson QMLE is asymp- 
totically more efficient; if assumption (18.28) holds, the two-step negative binomial 
estimator is more efficient. Notice that neither variance assumption contains the 
other as a special case for all parameter values; see Wooldridge (1997c) for additional 
discussion. 
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The variance specification tests discussed in Section 18.2.5 can be extended to the 
negative binomial QMLE; see Wooldridge (1991b). 


18.3.2 Binomial Regression Models 


Sometimes we wish to analyze count data conditional on a known upper bound. For 
example, Thomas, Strauss, and Henriques (1990) study child mortality within families 
conditional on number of children ever born. Another example takes the dependent 
variable, y;, to be the number of adult children in family 7 who are high school gradu- 
ates; the known upper bound, n;, is the number of children in family i. By conditioning 
on n; we are, presumably, treating it as exogenous. 

Let x; be a set of exogenous variables. A natural starting point is to assume that 
yi given (;,x;) has a binomial distribution, denoted Binomial [n;, p(x;, B)], where 
p(Xi,ß) is a function bounded between zero and one. In this setup, usually, y; is 
viewed as the sum of n; independent Bernoulli (zero-one) random variables, and 
P(x;,B) is the (conditional) probability of success on each trial. 

The binomial assumption is too restrictive for all applications. The presence of an 
unobserved effect would invalidate the binomial assumption (after the effect is inte- 
grated out). For example, when y; is the number of children in a family graduating 
from high school, unobserved family effects may play an important role. Generally, 
the presence of unobserved heterogeneity within group i violates the independence 
assumption (conditional on x;) that is used to derive the binomial distribution. 

As in the case of unbounded support, we assume that the conditional mean is 
correctly specified: 


E(y;|Xi,m:) = nip(xi,B) = mi(B). (18.32) 


This formulation ensures that E(y; | x;,7;) is between zero and n;. Typically, p(x;, £) 
= G(x;f), where G(-) is a cumulative distribution function, such as the standard 
normal or logistic function. 

Given a parametric model p(x, £), the binomial quasi-log likelihood for observa- 
tion 7 is 


ACB) = yi logl p(x; B)] + (ni — yi) logit — p(x: B)I, (18.33) 


and the binomial QMLE is obtained by maximizing the sum of 7;(f) over all N 
observations. From the results of GMT (1984a), the conditional mean parameters are 
consistently estimated under assumption (18.32) only. This conclusion follows from 
the general M-estimation results after showing that the true value of # maximizes the 
expected value of equation (18.33) under assumption (18.32) only. 
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The binomial GLM variance assumption is 


Var(y; | xn n) = o°nip(x:,B)[1 — p(x:,B)] = o7vi(B), (18.34) 


which generalizes the nominal binomial assumption with o? = 1. (McCullagh and 
Nelder [1989, Section 4.5] discuss a model that leads to assumption (18.34) with 
a? > 1. But underdispersion is also possible.) Even the GLM assumption can fail 
if the binary outcomes comprising y; are not independent conditional on (x;,1;). 
Therefore, it makes sense to use the fully robust asymptotic variance estimator for the 
binomial QMLE. 

Owing to the structure of LEF densities, and given our earlier analysis of the 
Poisson and negative binomial cases, it is straightforward to describe the econometric 
analysis for the binomial QMLE: simply take m; = nip(Xi, B), ui = yi — Mi, Vem; = 
niVpp;, and 6; = nip;(1 — p,) in equations (18.31). An estimator of o? under assump- 
tion (18.34) is also easily obtained: replace m; in equation (18.15) with 6;. The struc- 
ture of asymptotic variances and score tests is identical. 


18.4 Gamma (Exponential) Regression Model 


It is becoming more popular to directly model the expected value of nonnegative, 
continuous response variables rather than using a transformation (usually the natural 
log) and specifying a model linear in parameters with an additive error. Wooldridge 
(1992) makes the case that if E(y|x) is the quantity of interest when y > 0, then it 
makes sense to model this expectation directly. More recently, Blackburn (2007) has 
suggested estimating models of E(wage|x) directly, where wage is a measure of 
employee compensation, rather than using log(wage) in a linear regression. 

We set out the reasons for directly modeling E(y|x) in Section 2.2.2. For con- 
creteness, assume that y > 0, and so log( y) is well defined. If we postulate the stan- 
dard linear model log( y) = xf + u, we can ask: when can we recover partial effects, 
semielasticities, and elasticities on E(y |x)? A sufficient condition is to assume u and 
x are independent, in which case E(y|x) = « exp(xf), and « is identified from 
æ = Elexp(u)]. Duan’s (1983) smearing estimate is the most common way to estimate 
æ based on OLS residuals from the regression log(y;) on x;. Alternatively, we can 
allow dependence between u and x if we specify D(u|x), for example, u|x ~ 
Normal (0, exp(xy)), in which case E(y |x) = exp(xf + exp(xy)/2). 

Both previous approaches require restrictions on the conditional distribution 
D(y |x) in addition to specifying the conditional mean, and each is a roundabout way 
of obtaining estimates of E( y |x). Just as when y has some discreteness—for example, 
it is a count variable—it makes sense to specify simple, logically consistent models 
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for E(y |x). Again, the leading case is an exponential model: E(y |x) = exp(xf) with 
xı = 1. We already know how to interpret the parameters of such a model. 

It is important to remember that, regardless of the nature of y—provided it is 
nonnegative and has no natural upper bound—we can always apply the Poisson 
QMLE. Thus, even if y is continuous on (0,00), Poisson regression delivers consis- 
tent, /N-asymptotically normal estimators of the parameters in E(y |x) = exp(xf) 
(or any other correctly specified mean function). Even for some continuous random 
variables the Poisson QMLE can be asymptotically efficient in the class of estimators 
that specify only the mean. Sufficient is that assumption (18.3) holds, and this variance- 
mean relationship holds for certain parameterizations of the gamma distribution. 

Nevertheless, a constant variance-mean ratio is somewhat uncommon for non- 
negative continuous variables. In the model log( y) = xf + u with u independent of 
x, Var(y|x) = o?[exp(xf)]? where o? = Var[exp(w)]. A natural parameterization of 
the gamma distribution has the variance proportional to the square of the mean. As 
shown in GMT (1984a), for a fixed value of the variance/mean ratio, the log likeli- 
hood is in the LEF. In fact, for estimation purposes, we can set the ratio to unity, 
which gives us the log likelihood for the exponential distribution. With general mean 
function m(x, B) > 0, we have 


(B) = —yi/m(xi, B) — log|m(xi, B)), (18.35) 


and the gamma QMLE (sometimes called the exponential QMLE) is the estimator 
that maximizes this quasi-log-likelihood function summed across the sample. It is 
easy to show directly that the score, Vgm(x;,B)'{y;/[m(x:,B)]’ — 1/m(x;, B)} = 
Vgni(x;, B) {yi — mx, B)} //[m(x;, f)|* has zero conditional mean (technically, when 
evaluated at the “true” value of beta) whenever the mean is correctly specified. (Or, 
one can work with equation (18.35) directly, as in GMT (1984a).) In other words, 
just as with the Poisson QMLE, the gamma QMLE is fully robust to distributional 
misspecification other than the conditional mean. Specifying a conditional mean and 
estimating using the gamma QMLE is often called, as a shorthand, the gamma 
regression model. 

The gamma QMLE is efficient in the class of estimators that specify only E(y |x), 
including NLS and the Poisson QMLE, under the gamma GLM variance assump- 
tion, which we write as 


Var(y|x) = [Ey |x). (18.36) 


When o? = 1, assumption (18.36) gives the variance-mean relationship for the expo- 
nential distribution. Under assumption (18.36), ø is the coefficient of variation: it is 
the ratio of the conditional standard deviation of y to its conditional mean. 


742 Chapter 18 


Whether or not assumption (18.36) holds, an asymptotic variance matrix can be 
estimated. The fully robust form is expression (18.14), but, in defining the score and 
expected Hessian, the residuals and gradients are weighted by 1/m; rather than 
ma, |! ?. Under assumption (18.36), a valid estimator is 


=j 
N 
ô? (>. von 
i=l 
where 6? = N-! YY, ai? /m? and ô; = m?. Score tests and QLR tests can be com- 
puted just as in the Poisson case. Many statistical packages implement gamma re- 
gression with an exponential mean function, often as a feature of a GLM command. 


18.5 Endogeneity with an Exponential Regression Function 


With all of the previous models, standard econometric problems can arise. In this 
section, we study the problem of endogenous explanatory variables with an expo- 
nential regression function. 

We approach the problem of endogenous explanatory variables from an omitted 
variables perspective. Let yı be the nonnegative, in principle unbounded variable to 
be explained, and let z and y, be observable explanatory variables (of dimension 
1 x L and 1 x Gj, respectively). Let cı be an unobserved latent variable (or unob- 
served heterogeneity). We assume that the (structural) model of interest is an omitted 
variables model of exponential form, written in the population as 


E(y; |Z, Yo, ¢1) = exp(zi6) + yoy) + 1), (18.37) 


where zı isa 1 x Lı subset of z containing unity; thus, the model (18.37) incorporates 
some exclusion restrictions. On the one hand, the elements in z are assumed to be 
exogenous in the sense that they are independent of cı. On the other hand, y, and cı 
are allowed to be correlated, so that y, is potentially endogenous. 

To use a quasi-likelihood approach, we assume that y, has a linear reduced form 
satisfying certain assumptions. Write 


y = z; + v, (18.38) 


where HI, is an L x G; matrix of reduced form parameters and vz is a 1 x G, vector 
of reduced form errors. We assume that the rank condition for identification holds, 
which requires the order condition L — Lı > G4. In addition, we assume that (c1, v2) 
is independent of z, and that 
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C1 = V2p, +e1, (18.39) 


where e; is independent of vz (and necessarily of z). (We could relax the independence 
assumptions to some degree, but we cannot just assume that vz is uncorrelated with z 
and that e; is uncorrelated with v2.) It is natural to assume that v2 has zero mean, but 
it is convenient to assume that E[exp(e;)] = 1 rather than E(e,) = 0. This assumption 
is without loss of generality whenever a constant appears in zı, which should almost 
always be the case. 

If (c1, v2) has a multivariate normal distribution, then the representation in equa- 
tion (18.39) under the stated assumptions always holds. We could also extend equa- 
tion (18.39) by putting other functions of vz on the right-hand side, such as squares 
and cross products, but we do not show these explicitly. Note that y, is exogenous if 
and only if p; = 0. 

Under the maintained assumptions, we have 


E(y1 |Z, Y2; V2) = exp(zid1 + yo? + V221), (18.40) 


and this equation suggests a strategy for consistently estimating ôi, yı, and pı. If v2 
were observed, we could simply use this regression function in one of the QMLE 
earlier methods (for example, Poisson, two-step negative binomial, or gamma). Be- 
cause these methods consistently estimate correctly specified conditional means, we 
can immediately conclude that the QMLEs would be consistent. (If yı conditional on 
(Z,¥,¢1) has a Poisson distribution with mean in equation (18.37), then the distri- 
bution of yı given (Zz, y», V2) has overdispersion of the type (18.28), so the two-step 
negative binomial estimator might be preferred in this context.) 

To operationalize this procedure, the unknown quantities v) must be replaced with 
estimates. Let Ñ, be the L x G; matrix of OLS estimates from the first-stage esti- 
mation of equation (18.38); these are consistent estimates of TI). Define ¥2 = y, — 
zů, (where the observation subscript is suppressed). Then estimate the exponential 
regression model using regressors (Z1, y2, 2) by one of the QMLEs. The estimates 
(61,71.p1) from this procedure are consistent using standard arguments from two- 
step estimation in Chapter 12. 

This method is similar in spirit to the methods we saw for binary response (Chapter 
15) and Tobit regression models (Chapter 17). There is one difference: here, we do 
not need to make distributional assumptions about yı or y). However, we do assume 
that the reduced-form errors v2 are independent of z. In addition, we assume that cı 
and vz are linearly related with e; in equation (18.39) independent of v2. 

Because ® depends on Ñ, the variance matrix estimators for 61, 9, and p, should 
generally be adjusted to account for this dependence, as described in Sections 12.5.2 
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and 14.1. The bootstrap is also attractive because the two-step procedure takes little 
computational time (unless the sample size is very large). Using the results from 
Section 12.5.2, it can be shown that estimation of Il does not affect the asymptotic 
variance of the QMLEs when p, = 0, just as we saw when testing for endogeneity in 
probit and Tobit models. Therefore, testing for endogeneity of y, is relatively 
straightforward: simply test Ho : p4 = 0 using a Wald or LM statistic. When G; = 1, 
the most convenient statistic is probably the ¢ statistic on 62, with the fully robust 
form being the most preferred (but the GLM form is also useful). The LM test for 
omitted variables is convenient when G; > 1 because it can be computed after esti- 
mating the null model (p, = 0) and then doing a variable addition test for ¥2. The 
test has G; degrees of freedom in the chi-square distribution. 

There is a final comment worth making about this test. The null hypothesis is the 
same as E( y; |Z, y2) = exp(zıôı + yoy). The test for endogeneity of y, simply looks 
for whether a particular linear combination of y, and z appears in this conditional 
expectation. For the purposes of getting a limiting chi-square distribution, it does not 
matter where the linear combination ¥. comes from. In other words, under the null 
hypothesis none of the assumptions we made about (c1,v2) need to hold: v2 need 
not be independent of z, and e; in equation (18.39) need not be independent of 
v2. Therefore, as a test, this procedure is very robust, and it can be applied when y, 
contains binary, count, or other discrete variables. Unfortunately, if y, is endoge- 
nous, the correction does not work without something like the assumptions made 
previously. 


Example 18.2 (Is Education Endogenous in the Fertility Equation?): We test for 
endogeneity of educ in Example 18.1. The IV for educ is a binary indicator for whether 
the woman was born in the first half of the year (frsthalf), which we assume is ex- 
ogenous in the fertility equation. In the reduced-form equation for educ, the coeffi- 
cient on frsthalf is —.636 (se = .104), and so there is a significant negative partial 
relationship between years of schooling and being born in the first half of the year. 

When we add the first-stage residuals, 62, to the Poisson regression, its coefficient is 
.025, and its GLM standard error is .028. Therefore, there is little evidence against 
the null hypothesis that educ is exogenous. The coefficient on educ actually becomes 
larger in magnitude (—.046), but it is much less precisely estimated. 


It is straightforward to allow general nonlinear functions of (z),y,) in the expo- 
nential model because including the control function vz accounts for endogeneity 
quite generally. Further, it might be that y», as it appears in equation (18.37), might 
not have a linear representation as in (18.38), with v2 having the requisite properties. 
If the elements of y, are continuous, we can often find a strictly monotonic transfor- 
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mation such that a linear reduced form, with an additive error independent of z, 
is sensible. So, in the scalar case, if 0 < y, < 1, we might use log[y,/(1 — y)] = 
zm + v2 as the reduced form, even though y, itself appears in the structural equation. 
Because we can write y, = exp(zm2 + v2)/[1 + exp(zm + v2)], adding 62 to a QMLE 
analysis controls for the endogeneity of y». 

In addition to the parameters, one might want the partial effect on the mean, 
rather than just an elasticity or semielasticity. We can estimate the average partial 
effects by using the general approach for probit, ordered probit, and Tobit models. 
The estimated average structural function is 


N 
ASF(x1) = N! X exp(xiB, + ¥i2h)), (18.41) 
j=l 


and then we take derivatives or changes with respect to the elements in x; = 
gı (Z1, y2) for a known function g,(-). The bootstrap can be applied to obtain valid 
standard errors and confidence intervals. 

Recent theoretical work on linear models (for example, Bekker (1994)) as well as 
simulations (for example, Flores-Lagunes (2007)) suggest a single-step estimation 
method might have better finite-sample properties than two-step estimation, particu- 
larly with weak instruments or many overidentifying restrictions. It is easy to obtain 
a one-step version of the control function method just described. Suppose we decide 
to use the Poisson QMLE for the structural part of the model, and assume, for sim- 
plicity, that we have a single endogenous explanatory variable and that the linear 
reduced form is stated in terms of y, (the extension to /2(y,) is straightforward). 
Then, we can estimate both sets of parameters simultaneously by solving 


N 
min S “fexp(xiiB, + pi (Vin — Zit2)) — Va XB, +P1(VX2 — zin2)] 
il 


+ (Ya — zim)? /t3 + log(13)}, 


where @ is the full set of parameters, including the reduced form variance parameter 
13, and x; is whatever function of (z;1, yp) that appears in the model. (If we assume 
that y, |z ~ Normal(zz,2,12,), and y; | y2, z follows a Poisson distribution, then the 
minimization problem is the same as the limited information maximum liklelihood 
(LIML) estimator. The “limited information” identifier comes from the fact that we 
use a reduced form for y,.) This is a standard M-estimation problem, and its consis- 
tency holds provided the true values of the parameters solve the corresponding pop- 
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ulation problem. To this end, we label the population parameters by using the “o 
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subscript. Then we know z,2 minimizes E[(y, — z2)”/t3] for any 12 > 0 (even if 22 
simply indexes the linear projection), and then the “true” value z2, is the variance of 
> — ZTo2. Further, because we assume E(y, |Z, y2) = exp(x1 8,1 + Po (Y2 — ZTo2)), it 
follows that (f,),/ 1,702) minimize 


Efexp(xiB, + p1 (Yz — zin2)) — Va labi + Pi (V2 — zin2)]}. (18.42) 


Therefore, (o1, Pol; To2, T2) Minimizes the sum of the expected values. After M- 
estimation, we would use a fully robust sandwich estimator for the asymptotic vari- 
ance because the Poisson distribution for D( y; | z, y2) is almost certainly wrong. Plus, 
we may not wish to think vz is independent of z in y, = zm,2 + v2 (even though 
obtaining the form of E(y,|z, y2) uses something like it). If we write gi(0) = 
exp(xiB) + pi (Yin — 22) — Va KiB) + Pi (Yio — zm)] and gix(O) = (Yn — zim) /73 
+log(t3), then the scores evaluated at the true parameters are uncorrelated because 
E[Voqa (00) | Yn, Zi] = 0 and Vogi2(0.) depends only on (yp, Zi). 

The assumption that e; in (18.39) is independent of v and z rules out most cases 
where y, has some discreteness, such as with binary, count, or corner responses. (In 
particular, the independence assumption likewise fails unless vz is independent of z, 
and it would be very unusual for a discrete variable to be expressible as a linear 
equation with additive error independent of z.) Terza (1998) considered the case 
where y, is a binary endogenous explanatory variable and follows a reduced form 
probit: 


yə = l[zm +v > OI, v |z ~ Normal(0, 1). (18.43) 


As shown by Terza (1998), a control function (CF) method can be applied when 
(a), v2) has a joint normal distribution and is independent of z. To implement a CF 
approach, we need to find E(y; |z, y2) = exp(x1f,)E[exp(ci) |z, y2] where x; is a 
function of (z1, y2), which would almost certainly include y, linearly and possibly 
interact with elements of zı. Now, suppose (c1, v2) is independent of z with mean zero 
and jointly normal. Let t? = Var(cı) and p, = Cov(v2, c1), so that cy = p02 + €, 
where e; | Z, v2 ~ Normal(0, t? — p?). Then 


E(y, |Z, v2) = Eļexp(e1)] exp(xif; + p102) 
= exp((t7 — p1)/2) exp(xiB; + p102) 


= exp((t] — p7)/2 + x1B;) exp(p102). (18.44) 


To find E( y; | Z, y2), we have 


E(yı |z, Y2) = exp((t? — p7)/2) exp(xı$ı )Elexp(p102) | z, y2]. (18.45) 
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As shown by Terza (1998), and as is used in the two-part model in Section 17.6.3, 
Elexp(p,02) |Z, ¥ = 1] = Elexp(pi02) | 2,02 > —z72] 

= exp(p}/2)®(p; + zm) / (zm). (18.46) 
Similarly, 
Elexp(p,v2) |z, ¥ = 0] = exp(p?/2)[1 — ®(p; + 2m2)]/[1 — ®(zm)]. (18.47) 
It follows from (18.45), (18.46), and (18.47) that 


E(yı |z, y2) = exp(t?/2 + xi){®(p, + 203) /P(am3)}” 
- {[1 = ©(p, + 2m9)]/[1 — D(zm)]} 0. (18.48) 


Notice that, if x; contains unity, as it should, then only 77/2 + fj, is identified, along 
with the other elements of £,, p1, and 22. This is just fine because the average struc- 
tural function is ASF(z1, y2) = E,,[exp(xif, + c1)] = exp(t7/2 + xiB,)], and so the 
intercept that is identified is exactly what we want for computing APEs. Thus, in 
what follows we just absorb 77/2 into the intercept. 

In the first step of Terza’s two-step method we estimate the probit model of y, on z 
to obtain the MLE, zp. In the second step we estimate the mean function in (18.48) 
with z in place of m. We can use NLS or a quasi-MLE, such as the Poisson or 
gamma QMLE. Either way, our inference should account for the two-step estima- 
tion, using either the delta method or bootstrap. 

A simple test of Ho : p; = 0 is available. The derivative of the mean function with 
respect to pı, evaluated at p; = 0, is exp(x)f,)[A(zm2)|??[—A(—zm2)|" >’, where A(-) 
is the inverse Mills ratio. (We use the fact that ¢(a)/{1 — ®(a)] = ¢(—a)/®(—a).) 
Therefore, a simple variable addition test of p} = 0 can be obtained by adding the 
variable y, log[A(zi2)| — (1 — y2) log[A(—zaz)] to the exponential model exp(xif}). 
That is, for each i define fn = yp log[2(z;#2)] — (1 — yn) log[A(—z;a2)], and then use 
a QMLE or NLS to estimate the artificial mean function exp(x; 8; + p)fi2), and use a 
robust ¢ statistic for p, (which is not the estimate we obtain from Terza’s two-step 
method). 

Mullahy (1997) has shown how to estimate exponential models when some ex- 
planatory variables are endogenous without making assumptions about the reduced 
form of y,. This approach is especially attractive for dummy endogenous and other 
discrete explanatory variables, where the linearity in equation (18.39), coupled with 
independence of z and vo, is unrealistic. To sketch Mullahy’s approach, write xj = 
(z1,y>) and £; = (ôi, y|)". Then, under the model (18.37), we can write 
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yı exp(—xif}) =exp(ci)a,  E(aı |Z, y2,¢1) = 1. (18.49) 


If we assume that cı is independent of z—a standard assumption concerning 
unobserved heterogeneity and exogenous variables—and use the normalization 
Elexp(c:)] = 1, we have the conditional moment restriction 


E[yı exp(—xif;) |z] = 1. (18.50) 


Because y1, x1, and z are all observable, condition (18.50) can be used as the basis for 
generalized method of moments (GMM) estimation. The function g(y1, y>,z1;8)) = 
yı exp(—x1f,) — 1, which depends on observable data and the parameters, is uncor- 
related with any function of z (at the true value of $1). GMM estimation can be used 
as in Section 14.2 once a vector of instrumental variables has been chosen. 

An important feature of Mullahy’s approach is that no assumptions, other than the 
standard rank condition for identification in nonlinear models, are made about the 
distribution of y, given z: we need not assume the existence of a linear reduced form 
for y, with errors independent of z. Mullahy’s procedure is computationally more 
difficult, and testing for endogeneity in his framework is harder than in the QMLE 
approach. Therefore, we might first use the two-step quasi-likelihood method pro- 
posed earlier for testing, and if endogeneity seems to be important, Mullahy’s GMM 
estimator can be implemented. See Mullahy (1997) for details and an empirical 
example. 


18.6 Fractional Responses 


We now consider models and estimation methods when the response variable, y;, 
takes values in the unit interval, [0,1]. Because we have already thoroughly covered 
binary response models in Chapter 15, we are thinking of cases where y; is not a 
binary response. 


18.6.1 Exogenous Explanatory Variables 


In Section 17.8 we covered the two-limit Tobit model, which can be applied to frac- 
tional response variables when the limits are zero and one. Having estimated the two- 
limit Tobit given a set of explanatory variables, x;, we can estimate the conditional 
mean, E(y;|x;), as well as various probabilities. But there are two drawbacks to 
using Tobit to model fractional responses. First, it does not apply unless there is a 
pileup at both zero and one. Fractional responses that have continuous distributions 
in (0, 1) cannot follow at two-limit Tobit, nor can responses that have a mass point at 
zero or one but not both. Second, the two-limit Tobit imposes a parametric model on 
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the density for D(y;|x;). If we are interested primarily in the effects on the condi- 
tional mean, then the two-limit Tobit—even when it logically applies—will generally 
produce inconsistent estimates of E(y;|x;). (How serious the problem might be has 
not been seriously investigated.) 

One can always search for other distributions that are logically consistent with the 
nature of the response variable. For example, if y; is a continuous fractional response 
on (0,1), a conditional beta distribution is one possibility, and we can apply stan- 
dard MLE methods for estimation. Kieschnick and McCullough (2003) suggest this 
approach. Like the two-limit Tobit approach, MLE using the beta distribution is in- 
consistent for all parameters if any aspect of the distribution is misspecified. Conse- 
quently, if one is primarily interested in the conditional mean, specifying a beta 
distribution is not robust. It not only rules out applications where y; has a mass point 
at zero or one, but it is inconsistent when y; is continuous on (0,1) but follows a 
distribution other than the beta distribution. 

When y is strictly between zero and one, an alternative approach is to assume the 
log-odds transformation of y, log|y/(1— y)], has a conditional expectation of the 
form xf. Then, a simple estimator of p is the OLS estimator from the regression 
w; on x;, where w; = log[y;/(1 — y;)]. While simple, the log-odds approach has a 
couple of drawbacks. First, it cannot be applied to corner solution responses un- 
less we make some arbitrary adjustments. Because log[y/(1 — y)] ~ —co as y— 0 
and log[y/(1 — y)] —> œ% as y > 1, we might worry that our estimates are sensitive to 
the adjustment. Second, even if y is strictly in the unit interval, £ is difficult to inter- 
pret: without further assumptions, it is not possible to estimate E( y | x) from a model 
for E{log|y/(1 — y)]| x}. See Papke and Wooldridge (1996) for further discussion. 

One possibility is to assume the log-odds transformation yields a linear model with 
an additive error independent of x, so 


log[y/(1 — y) =x +e, D(e|x) = De), (18.51) 
where we take E(e) = 0 (and assume that xı = 1). Then, we can write 
y = exp(xf + e)/[1 + exp(xf + e)]. (18.52) 


Now, if e and x are independent, then 
E(y|x) = [oves + e)/[1 + exp(xf + e)] dE (e), (18.53) 
where F(-) is the distribution function of e (and we use e as the dummy argument in 


the integration). As shown by Duan (1983) in a general retransformation context, for 
given x this expectation can be consistently estimated as 
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N 
E(y|x) = N X exp(xB + @)/[1 + exp(xf + 6], (18.54) 
i=! 


where B is the OLS estimator from w; on x; and ê; = wi — xB are the OLS residuals 
from the log-odds regression. A similar analysis applies if we replace the log-odds 
transformation in (18.51) with ®~! (y), where ®~!(-) is the inverse function of the 
standard normal cdf, in which case we average ®(xf + ê) across i to estimate 
E(y|x). 

When it is applicable, Duan’s method imposes fewer assumptions than a fully 
parametric model for D(y |x). Nevertheless, it still is a roundabout way of estimating 
partial effects on E(y |x). An alternative is to directly specify models for E(y | x) that 
ensure predicted values are in (0, 1). For example, we can specify E(y |x) as a logistic 
function: 


E(y|x) = exp(xf)/[l + exp(xp)], (18.55) 
or as a probit function, 
E(y|x) = ®(xf). (18.56) 


In each case the fitted values will be in (0,1) and, of course, each allows y to take on 
any values in [0, 1], including at the endpoints zero and one. Also, just as in logit and 
probit models for binary responses, the partial effects diminish as xf — oo. We can 
compute these partial effects just as in the binary response case, and compute APEs 
for continuous and discrete variables as in Chapter 15. The difference is that now 
these are partial effects on the expected value of a fractional response, not on a con- 
ditional probability of a binary response. As with the linear probability model for 
binary response, the APEs in the nonlinear models can be compared with coefficients 
from a linear regression of y; on x;. 

Of course, nothing requires us to choose a model for E(y|x) that depends on a 
cumulative distribution function; any function bounded in (0,1) would do. But the 
formulations in (18.55) and (18.56) are convenient for estimation and interpretation. 
In the next subsection, we will see that (18.56) easily allows certain kinds of endoge- 
nous explanatory variables. 

Given that (18.55) and (18.56) specify a conditional mean function—say, G(xf)— 
one approach to estimation is NLS. NLS is consistent and inference is straightfor- 
ward, provided we use the fully robust sandwich variance matrix estimator that does 
not restrict Var( y |x). Nevertheless, as in estimating models of conditional means for 
unbounded, nonnegative responses, NLS is unlikely to be efficient because common 
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distributions for a fractional response imply heteroskedasticity; this is true of the two- 
limit Tobit and beta distributions, among others. Instead, we could specify a flexible 
variance function and use weighted NLS in a two-step procedure, as in Chapter 12. 
A simpler, one-step strategy is to use a quasi-likelihood approach. Papke and Wool- 
dridge (1996) note that, because the Bernoulli log likelihood is in the linear expo- 
nential family, the results of GMT (1984a) can be applied to conclude that the 
QMLE that solves 


N 
max X /{(1 — y;) log[1 — G(xib)] + y; loglG(xib)]} (18.57) 
i=1 


is consistent whenever the conditional mean is correctly specified. (A careful state- 
ment of the result requires distinguishing a generic value of the parameter vector 
from the “true” value.) Notice that (18.57) is well defined for any y; in [0,1] and 
functions G(-) e (0,1). Plus, it is a standard estimation problem because it is identical 
to estimating binary response models. We call estimation of (18.55) by Bernoulli 
QMLE fractional logit regression and (18.56) estimated by Bernoulli QMLE frac- 
tional probit regression. 

There has been some confusion in the literature about the nature of the robustness 
of the Bernoulli QMLE from (18.57) and how it compares with other methods. For 
example, when 0 < y <1, Kieschnick and McCullough (2003) recommend that 
researchers choose the fully parametric beta MLE over the Bernoulli QMLE “‘un- 
less their sample size is large enough to justify the asymptotic arguments underlying 
the quasi-likelihood approach” (p. 193). This statement is highly misleading. The 
robustness of the Bernoulli QMLE arises because the population version of the 
objective function in (18.57) identifies the parameters in a correctly specified condi- 
tional mean, without any extra assumptions. By contrast, the beta log-likelihood 
function does not have this feature unless the full beta distribution is correctly speci- 
fied. The sample size is irrelevant for choosing between the two approaches because 
in each case the statistical properties are based on asymptotic analysis, and there is no 
evidence that the beta MLE is better behaved in small samples than the Bernoulli 
QMLE. Generally, it is important to remember that the notion of robustness for 
identifying parameters, with or without various misspecifications, is a population 
issue. 

As we know from Chapter 15, the variance associated with the log likelihood in 
(18.57) is G(x;f)[1 — G(x;f)]. Therefore, it is natural to specify the Bernoulli GLM 
variance assumption as 
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Var(y|x) = PE(y|x)[1 — E(y |x). (18.58) 


If a? = 1, then we can actually use the usual estimated inverse information matrix for 
inference. However, if (18.58) holds, it is often with a? < 1, and so, in this case, in- 
ference based on the usual binary response statistics will be too conservative—often, 
much too conservative. If assumption (18.58) does hold, we estimate the asymptotic 
variance as in (18.37) with 6; = G(x;B)[1 — G(x,f)], Verity = g(xiB)x; (where g(-) is 
the derivative of G(-)), and 


= 


@=(N-K)'S #/5, (18.59) 
i=] 


where K is the number of parameters and a; = y; — G(x;f). Not surprisingly, if 
(18.58) holds, the Bernoulli QMLE is asymptotically efficient among estimators that 
specify only E(y |x), assuming, as always, that the mean is correctly specified. (Alge- 
braic properties for the Beroulli log likelihood obtained for the binary case apply in 
the current situation. In particular, for fractional logit, if x; includes a constant, as it 
almost always should, the residuals ù; sum to zero, while this is not true for fractional 
probit.) 

When does (18.58) hold for fractional responses? One case is when y; is a propor- 
tion, say, y; = s;/n;, where s; is the number of “successes” in n; Bernoulli draws. 
Suppose that s; given (n;,x;) follows a binomial distribution, as in Section 18.3.2 with 
p(x, B) = G(xf). Then E(y; | 7;,x;) =; 'E(s;|:,x;) = G(x;B) and Var(y; | 1:,x;) = 
n7? Var(si | ni xi) =n; '|G(xiB)[1 — G(x:f)]. Now, suppose that n; is independent of 
x;. Then 


Var(y;| xi) = War[E(y; | 7, xi) | xi] + E[War(y; |n, xi) | xi] 
= 0+ E(n;'|x;)G(xB)[1 — G(xiB)] 
= 0° G(x;f)[1 — G(x,B)], (18.60) 


where g? = E(n7') < 1 (with strict inequality unless n; = 1 with probability one). 
Therefore, if y, is obtained as a proportion of Bernoulli successes, we can expect 
underdispersion in (18.58). Further, if we are given data on proportions but do not 
know n;, it makes sense to use a fractional logit or probit analysis. If we observe the 
n;, we might use a binomial regression model instead. 

It is easy to think of cases where the GLM assumption (18.58) fails. For example, 
it is unlikely that n; and x; are independent in all applications. For example, in Papke 
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and Wooldridge (1996), n; is number of workers at firm i, y; is the fraction partici- 
pation in a 401(k) pension plan, and x; includes firm characteristics. Continuing with 
this example, if the probability of participating in a pension plan depends on unob- 
served worker and firm characteristics, this within-firm correlation generally invalid- 
ates Var(s; |n; X;) = miG(xif)[1 — G(x;P)], in which case (18.58) also fails. Therefore, 
Papke and Wooldridge (1996) recommend fully robust sandwich standard errors and 
test statistics, which are easy to compute using GLM routines in popular software 
packages. 

Variable addition tests for functional form are easily obtained. For example, after 
obtaining fractional regression estimates, we can add powers of x;B—say, the square 
and cube—to a subsequent fractional regression and carry out a robust joint test. See 
Papke and Wooldridge (1996) for further discussion and an application to 401(k) 
plan participation rates, and also Problem 18.14. 


18.6.2 Endogenous Explanatory Variables 


The fractional probit model can easily handle certain kinds of continuous endoge- 
nous explanatory variables. As in Section 18.5, we consider a specification with an 
unobserved factor that we would like to condition on: 


E(y1 |Z, ¥2,¢1) = E(y1 | 21, Y2, ¢1) = O(z18) + yy y2 + c1) (18.61) 
Yı = UM. + V2 = 21M] + 22722 + V2, (18.62) 


where cı is an omitted factor thought to be correlated with y, but independent of the 
exogenous variables z. Ideally, we could simply assume that the linear equation for 
yə simply represents a linear projection; unfortunately, we need to assume more, and 
here we effectively assume that v2 is independent of z. More specifically, we assume 


C1 = pita +e), e |Z, v2 ~ Normal(0, cå), (18.63) 


where a sufficient, though not necessary, condition is that (c1, v2) is bivariate normal 
and independent of z. Under (18.61), (18.62), and (18.63), we can show that 


E(y1 |Z, y2) = E(yı |Z, yo, 02) = B(z10e1 + Ye1 Y2 + Perl2); (18.64) 
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where the “e” subscript denotes multiplication by the scale factor 1/(1 +?) 
Fortunately, as discussed in Wooldridge (2005a), equation (18.64) can be used as the 
basis for estimating APEs. The CF approach is now fairly clear. In the first step, 
obtain the OLS residuals ĉ;2 from the regression yp on z;. Next, use fractional probit 
of ya ON Za, Yn, Ôn to estimate the scaled coefficients. 


1/2 
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A simple test of the null hypothesis that y, is exogenous is the fully robust ¢ sta- 
tistic on 0,2; as with other tests based on adding residuals, the first-step estimation can 
be ignored under the null. If p; 40, then the robust sandwich variance matrix esti- 
mator of the scaled coefficients is not valid because it does not account for the first- 
step estimation. The formulas for two-step estimation from Chapter 12 can be used. 
Bootstrapping the two-step procedure is quite feasible because computational time 
for each sample is minimal. 

The average structural function is consistently estimated as 


N 
ASF(21, y2) = NU! XC @(21de1 + ĵe Y2 + Beri), (18.65) 


i=l 


and this can be used to obtain APEs with respect to y, or zı. Bootstrapping the 
standard errors and test statistics is a sensible way to proceed with inference. 

As discussed in Wooldridge (2005c), the basic model can be extended in many 
ways. For example, if the y, we want to appear in (18.65) might not naturally have 
a linear reduced form with an independent error, we might use a strictly monotonic 
transformation of it in (18.62): that is, replace y, with h2( y2). If y, > 0 then h2( y2) = 
log(y>) is natural; if 0 < y, < 1, we might use the log-odds transformation in (18.62), 
ho(y) = logiy./(1 — y2)]. Unfortunately, if y, has a mass point—such as a binary 
response, or corner response, or count variable—a transformation yielding an addi- 
tive, independent error probably does not exist. 

More generally, we can let x; = kı (z1, y2) for a vector of functions kj)(-,-), and 
allow a set of reduced forms for strictly monotonic functions hzg( X24), g = 1,...,Gi, 
where G; is the dimension of y,. Further, if we are willing to assume that D(c; | z, v2) 
is independent of z with mean a polynomial in v2, where vz is a scalar for simplicity, 
then we are justified in adding nonlinear functions of ĉ to the fractional probit. For 
example, we could use the control functions ĉn and (6% — t3), where 7} is the usual 
estimated error variance from the first-stage regression. See Wooldridge (2005c) for 
further discussion and extensions. 

The previous method relies on y, being continuous because we require that a 
strictly monotonic transformation of y, can be written in a linear fashion with 
additive error independent of z. We can handle a binary y, rather easily if we 
maintain that D(y, |z) follows a probit. In fact, we can maximize the same log 
likelihood as we derived in Section 15.7.3, even though y, is a fractional response. 
To see why this works, first note that the average structural function in (18.61) is 
(x) f,/(1+ o)”), and so we hope to estimate Ba = B,/(1+ o2) (and cannot 
estimate £; and o? separately, anyway—see also Section 15.7.1). Next, note that we 


c 
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can obtain E(y;|2,¥2,¢1) as E(yı |Z, yz, ¢1) = E(B, + cı +r > 0] |z, y2,c1}, 
where D(rı |Z, y2,c1) = Normal(0, 1). By iterated expectations, 


E(yı |z, y2) = E{ 1B; + c1 +171 = 0] |z, y2} 
= Ef{l[xi PB. +e) = 0] |z, yo}, (18.66) 


where e; = (c +r1)/(1 + a2)" ? is independent of z with a standard normal distri- 
bution. If 


Vo = lzm + v > OI, (18.67) 


and we assume (cj, v2) is independent of z with a zero mean bivariate normal distri- 
bution, then (e1, v2) is independent of z with a bivariate normal distribution where 
each is standard normal. Let p, = Corr(v2,e1). It follows from (18.66) and (18.67) 
that E( y; |Z, y2) has exactly the form of P(w; = 1 |z, y3), where w; = 1[Xx ba +e; = 
0]. In other words, even though y; is not binary, its expected value given (z, yz) is the 
same as the response probability implied by the bivariate probit model from Section 
15.7.3. Because the Bernoulli log likelihood is in the linear exponential family, it 
identifies the correctly specified conditional mean. Further, we are assuming that 
D(y, |z) follows a probit. To be technically precise, if we add “o” subscripts to de- 
note the true population values, 2,2 maximizes E[log f (yp |Z; 2)] and (Boe1; Pol; %o2) 
maximizes Eflog f(y, | ¥j2,2i:3B.1,P1,72)|. It follows that the true parameters 
maximize 


E[log f(Vu | Vin» Zi; Ber, P1, 72)] T E[log fn | Zi; T2), 


and so the quasi-MLE using the usual bivariate probit log likelihood is consistent 
and asymptotically normal. The scores for the two quasi-log likelihoods are still 
uncorrelated, but the information matrix equality would not hold for the first part 
of the quasi-log likelihood because log f(y; | Yn, Zi; Bea, p1; 72) is not a true density. 
An M-estimator sandwich covariance matrix estimator—see equations (12.47) and 
(12.48)—is required and is straightforward to compute in this setting. Naturally, the 
bootstrap can be used, too. 


18.7 Panel Data Methods 


In this final section, we discuss estimation of panel data models, primarily focusing 
on count data. Our main interest is in models that contain unobserved effects, but 
we initially cover pooled estimation when the model does not explicitly contain an 
unobserved effect. 
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The pioneering work in unobserved effects count data models was done by Haus- 
man, Hall, and Griliches (1984) (HHG), who were interested in explaining patent 
applications by firms in terms of spending on research and development. HHG devel- 
oped random and fixed effects (FE) models under full distributional assumptions. 
Wooldridge (1999a) has shown that one of the approaches suggested by HHG, which 
is typically called the fixed effects Poisson model, has some nice robustness properties. 
We will study those here. 

Other count panel data applications include (with response variable in parentheses) 
Rose (1990) (number of airline accidents), Papke (1991) (number of firm births in 
an industry), Downes and Greenstein (1996) (number of private schools in a public 
school district), and Page (1995) (number of housing units shown to individuals). The 
time series dimension in each of these studies allows us to control for unobserved het- 
erogeneity in the cross section units, and to estimate certain dynamic relationships. 

As with the rest of the book, we explicitly consider the case with N large relative to 
T, as the asymptotics hold with T fixed and N — oo. 


18.7.1 Pooled QMLE 


We begin by discussing pooled estimation after specifying a model for a conditional 
mean. Let {(x;, y:): t= 1,2,..., T} denote the time series observations for a random 
draw from the cross section population. We assume that, for some f, € 2, 


E(y,| X:) = mxs, Bo), t= AREE Es (18.68) 


This assumption simply means that we have a correctly specified parametric model 
for E(y,|x;). For notational convenience only, we assume that the function m itself 
does not change over time. Relaxing this assumption just requires a notational 
change, or we can include time dummies in x,. For y, > 0 and unbounded from 
above, the most common conditional mean is exp(x;f). There is no restriction on the 
time dependence of the observations under assumption (18.68), and x, can contain 
any observed variables. For example, a static model has x, = z,, where z, is dated 
contemporaneously with y,. A finite distributed lag has x, containing lags of z;. Strict 
exogeneity of (x;,...,xz), that is, E(y,|x1,...,xr) = E(y,|x;), is not assumed. In 
particular, x, can contain lagged dependent variables, although how these might 
appear in nonlinear models is not obvious (see Wooldridge (1997c) for some possi- 
bilities). A limitation of model (18.68) is that it does not explicitly incorporate an 
unobserved effect. 

For each i = 1,2,...,N,{(Xi, yu): t= 1,2,..., T} denotes the time series obser- 
vations for cross section unit i. We assume random sampling from the cross section. 
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One approach to estimating J, is pooled NLS, which was introduced in Section 
12.9. When y is a count variable, a Poisson QMLE can be used. This approach is 
completely analogous to pooled probit and pooled Tobit estimation with panel data. 
Note, however, that we are not assuming that the Poisson distribution is true. 

For each i, the quasi-log likelihood for pooled Poisson estimation is (up to additive 
constants) 


T T 
GB) = X Aya logim(xir, B)] — m(xir, B)} = X lB). (18.69) 
t=1 =l 


The pooled Poisson QMLE then maximizes the sum of 7;(f) across i= 1,..., N. 
Consistency and asymptotic normality of this estimator follows from the Chapter 
12 results, once we use the fact that $, maximizes E[/;(£)]; this follows from GMT 
(1984a). Thus, pooled Poisson estimation is robust in the sense that it consistently 
estimates J, under assumption (18.68) only. 

Without further assumptions we must be careful in estimating the asymptotic 
variance of f. Let s;(B) be the P x 1 score of /,(B), which can be written as s;(B) = 
Da Si(B), where s;,(B) is the score of ¢;,(B); each s;,(B) has the form (18.12) but 
with (xj, Yi) in place of (x;, yi). 

The asymptotic variance of VN(f — B,) has the usual form A,'BA,', where A, = 
SLi E[Vernir( Bo) Vomal Bo) /mir( Bo)] and By = E|s;(B,)s;(f,)']. Consistent estima- 
tors are 


N T 

A= N'YO Ņ Vori, Vg [ti (18.70) 
i=l t=1 

A N A A 

B= N X s:(ĝ)s:(ĝ)', (18.71) 
i=l 


and we can use A~'BA~!/N for Avar(). This procedure is fully robust to the pres- 
ence of serial correlation in the score and arbitrary conditional variances. It should be 
used in the construction of standard errors and Wald statistics. The quasi-LR statistic 
is not usually valid in this setup because of neglected time dependence and possible 
violations of the Poisson variance assumption. 

If the conditional mean is dynamically complete in the sense that 


E(y, | Xt, Vi-1, Xt-1)--- y1; X1) _ E(y, | Xz), (18.72) 


then {sj(B,): t= 1,2,..., T} is serially uncorrelated. Consequently, under assump- 
tion (18.72), a consistent estimator of B is 
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Sir(B)Si( Ê). (18.73) 


1 


N T 
B= 

i=l t= 
Using this equation along with A produces the asymptotic variance that results from 
treating the observations as one long cross section, but without the Poisson or GLM 
variance assumptions. Thus, equation (18.73) affords a certain amount of robustness, 
but it requires the dynamic completeness assumption (18.72). 


There are many other possibilities. If we impose the GLM assumption 


Var( ya | Xa) = o2m(Xit, Bo), t=1,2,...,T, (18.74) 


along with dynamic completeness, then Avar(f) can be estimated by 


-1 
N T 
6? (>. 5 wih ; (18.75) 
i=l t=1 


where 62 = (NT — P'EN YL 2, tt = ttn, and tit = vis — malh). This 
estimator results in a standard GLM analysis on the pooled data. 

A very similar analysis holds for pooled gamma QMLE by simply changing the 
quasi-log likelihood and associated statistics. 


18.7.2 Specifying Models of Conditional Expectations with Unobserved Effects 


We now turn to models that explicitly contain an unobserved effect. The issues that 
arise here are similar to those that arose in linear panel data models. First, we must 
know whether the explanatory variables are strictly exogenous conditional on an 
unobserved effect. Second, we must decide how the unobserved effect should appear 
in the conditional mean. 

Given conditioning variables x,, strict exogeneity conditional on the unobserved 
effect c is defined just as in the linear case: 


E(y,|X1,---,X7,¢) = E(y,|Xz,¢). (18.76) 


As always, this definition rules out lagged values of y in x,;, and it can rule out feed- 
back from y, to future explanatory variables. In static models, where x, = z; for 
variables z, dated contemporaneously with y,, assumption (18.76) implies that neither 
past nor future values of z affect the expected value of y,, once z, and c have been 
controlled for. This can be too restrictive, but it is often the starting point for ana- 
lyzing static models. 

A finite distributed lag relationship assumes that 


E(y;| Ze Zt-1; -< , Z1, €) = E( y; | Zt, Dist eBoy) t>Q, (18.77) 
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where Q is the length of the distributed lag. Under assumption (18.77), the strict 
exogeneity assumption conditional on c becomes 


E(y,| Z1, Z2, ..., ZT, C) = EC) | Z1,- -Zr 6), (18.78) 


which is less restrictive than in the purely static model because lags of z; explicitly 
appear in the model; it still rules out general feedback from y, to (Z41,...,Z7). 

With count variables, a multiplicative unobserved effect is an attractive functional 
form: 


E(y;, |X, c) = c- m(x, Bo), (18.79) 


where m(x;,8) is a parametric function known up to the P x 1 vector of parameters 
Po. Equation (18.79) implies that the partial effect of x; on log E(y,|x:,c) does not 
depend on the unobserved effect c. Thus, quantities such as elasticities and semi- 
elasticities depend only on x; and f,. The most popular special case is the exponential 
model E(y,|x;,@) = exp(a + x,f), which is obtained by taking c = exp(a). 


18.7.3 Random Effects Methods 


A multiplicative random effects model maintains, at a minimum, two assumptions for 
a random draw i from the population: 


E( yy | Xi,- -3 XiT, Ci) = c(Xit, Po), t= l 2T (18.80) 
E(ci | Xa... Xir) = E(c;) = 1, (18.81) 


where c; is the unobserved, time-constant effect and the observed explanatory vari- 
ables, xx, may be time constant or time varying. Assumption (18.80) is the strict 
exogeneity assumption of the x; conditional on c;, combined with a regression func- 
tion multiplicative in c;. When yx > 0, such as with a count variable, the most pop- 
ular choice of the parametric regression function is m(x;,f8) = exp(x,f), in which 
case x; would typically contain a full set of time dummies. Assumption (18.81) says 
that the unobserved effect, c;, is mean independent of x;; we normalize the mean to be 
one, a step which is without loss of generality for common choices of m, including the 
exponential function with unity in x,. Under assumptions (18.80) and (18.81), we can 
“integrate out” c; by using the law of iterated expectations: 


E( viz | Xi) = E( viz | Xir) = m(Xir, Bo), t= A 2itang ds (18.82) 


Equation (18.82) shows that £, can be consistently estimated by the pooled Poisson 
method discussed in Section 19.6.1. The robust variance matrix estimator that allows 
for an arbitrary conditional variance and serial correlation produces valid inference. 
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Just as in a linear random effects model, the presence of the unobserved heterogeneity 
causes the yy to be correlated over time, conditional on x;. 

When we introduce an unobserved effect explicitly, a random effects analysis typi- 
cally accounts for the overdispersion and serial dependence implied by assumptions 
(18.80) and (18.81). For count data, the Poisson random effects model is given by 


Vit | Xi, ci ~ Poisson|[cim(Xxir, By) (18.83) 
Yin Vir are independent conditional on xi, cj, tAr (18.84) 
ci is independent of x; and distributed as Gamma(0o, ĝo), (18.85) 


where we parameterize the gamma distribution so that E(c;) = 1 and Var(c;) = 
1/59 = n2. While Var( y; |X; ci) = E(y;,| xi, ci) under assumption (18.83), by equa- 
tion (18.28), Var(y;,|x:) = E(y;,| xi)[1 +42E(yş|x;)], and so assumptions (18.81) 
and (18.85) imply overdispersion in Var(y;|x;). Although other distributional as- 
sumptions for c; can be used, the gamma distribution leads to a tractable density for 
(Ya ---, Yir) given x;, which is obtained after c; has been integrated out. (See HHG, 
p. 917, and Problem 18.11.) Maximum likelihood analysis (conditional on x;) is rel- 
atively straightforward and is implemented by some econometrics packages. 

If assumptions (18.81), (18.82), and (18.83) all hold, the conditional MLE is effi- 
cient among all estimators that do not use information on the distribution of x;; see 
Section 14.5.2. The main drawback with the random effects Poisson model is that it 
is sensitive to violations of the maintained assumptions, any of which could be false. 
(Problem 18.5 covers some ways to allow c; and X; to be correlated, but they still rely 
on stronger assumptions than the FE Poisson estimator, which we cover in Section 
18.7.4.) 

A quasi-MLE random effects analysis keeps some of the key features of assump- 
tions (18.83)—(18.85) but produces consistent estimators under just the conditional 
mean assumptions (18.80) and (18.81). Nominally, we maintain assumptions (18.83)— 
(18.85). Define up = ya — Ef Yu | Xu) = Vie — MXi f). Then we can write up = 
iMa, Bo) + eit — Mil Bo) = eit + mir(B,) (ci — 1), where en = ya — E( Yy | Xi, ci). AS we 
showed in Section 19.3.1, 


E(u; |i) = malbo) + noma (Bo). (18.86) 
Further, for t 4 r, 
E(wirtlr | Xi) = Efc — 1)" rin Bo)ir(Bo) = nimi( Bo Jm (Bo); (18.87) 


where 72 = Var(c;). The serial correlation in equation (18.87) is reminiscent of the 
serial correlation that arises in linear random effects models under standard assump- 
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tions. This shows explicitly that we must correct for serial dependence in computing 
the asymptotic variance of the pooled Poisson QMLE in Section 18.7.1. The over- 
dispersion in equation (18.86) is analogous to the variance of the composite error in 
a linear model. A QMLE random effects analysis exploits these nominal variance 
and covariance expressions but does not rely on either of them for consistency. If we 
use equation (18.86) while ignoring equation (18.87), we are led to a pooled negative 
binomial analysis, which is very similar to the pooled Poisson analysis except that the 
quasi-log likelihood for each time period is the negative binomial discussed in Section 
18.3.1. See Wooldridge (1997c) for details. 

If condition (18.87) holds, it is more efficient—perhaps much more efficient—to 
use the weighted multivariate nonlinear least squares estimator (WMNLS) discussed 
in Section 12.9.2. We simply construct an estimate of the conditional variance matrix 
based on (18.86) and (18.87). To implement the method, we would obtain the pooled 
Poisson QMLE of $, say, . We can use this estimator to estimate n>. One possibility 
is to note that E[(u2 — mj(B,))/mi(B,) | xi] = n2mi(B,). Let ù? = (yy —mi(B))* be 
the squared residuals from the pooled Poisson QMLE. Then obtain 47 from the 
pooled simple regression (through the origin) [i7/mj(B)] — 1 on mi(B)], t= 1,..., 
T,i=1,...,N. Now, given #? and £, we can estimate the conditional variance and 
conditional covariances in (18.86) and (18.87). Call the resulting T x T matrix Ŵ;. 
Then recall that the WMNLS estimator, Ê, solves 


N 
min ) “ly; = m(B)]'W y = mi(B)]" 
i=] 


Under assumptions (18.86) and (18.87), the WMNLS estimator is relatively efficient 
among estimators that only require a correct conditional mean for consistency, and 
its asymptotic variance can be estimated as 


N -1 
Avâr(ĝ) = (>. wi . 
i=l 
As with the other QMLEs, the WMNLS estimator is consistent under assumptions 
(18.80) and (18.81) only, but if assumption (18.86) or (18.87) is violated, the variance 


matrix needs to be made robust. Letting a; = y; — m;( f) (a T x 1 vector), the robust 
estimator is 


N “ly N N B 
(>. wi (>: Vaa WV; 80, WN (>: wi 
j j i=l 


i=1 =l 
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This expression gives a way to obtain fully robust inference while having a relatively 
efficient estimator under the random effects assumptions (18.86) and (18.87). 

GMT (1984b) cover a model that suggests an alternative form of W;. The matrix 
W; can be modified for other nominal distributional assumptions, such as the gamma 
(which would be natural to apply to continuous, nonnegative yj.) Further, a typical 
generalized estimating equation (GEE) approach would choose a different estimator 
of the variance matrix. The GEE approach associated with the Poisson QMLE would 
be to maintain the nominal Poisson variance assumption, Var( y; | x;) = 07m(Xxit,B,), 
along with a constant working correlation matrix. The resulting variance-covariance 
matrix cannot be derived from an unobserved effects model; generally, it will be in- 
efficient under (18.83) to (18.85). See Section 12.9.2 for further discussion on GEE. 

We must remember that none of the suggested WMNLS methods that allow non- 
zero correlation is consistent if E(y,,|x;) #m(xir,B,). In the context of an unob- 
served effects model, we usually think of a misspecified conditional mean as coming 
either from lack of strict exogeneity of {xj : t = 1,..., T} (conditional on c;) or from 
correlation between c; and x;. 


18.7.4 Fixed Effects Poisson Estimation 


HHG first showed how to do an FE type of analysis of count panel data models, 
which allows for arbitrary dependence between c; and x;. Their FE Poisson assump- 
tions are (18.83) and (18.84), with the conditional mean given still by assumption 
(18.80). The key is that neither assumption (18.85) nor assumption (18.81) is main- 
tained; in other words, arbitrary dependence between c; and x; is allowed. HHG take 
m(Xi,B) = exp(x;,8), which is by far the leading case. 

HHG use Andersen’s (1970) conditional ML methodology to estimate f. Let n; = 
SŽ] Ya denote the sum across time of the counts across t. Using standard results on 
obtaining a joint distribution conditional on the sum of its components, HHG show 
that 


y; | ni, Xi, &i ~ Multinomial{n;, pı (xi, B,),---.pr (Xi, Bo) j, (18.88) 
where 
T 
DAXi,B) = MXi B)/ > nisi) ‘ (18.89) 
r=1 


Because this distribution does not depend on c;, equation (18.88) is also the distribu- 
tion of y, conditional on n; and x;. Therefore, f, can be estimated by standard con- 
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ditional MLE techniques using the multinomial log likelihood. The conditional log 
likelihood for observation i, apart from terms not depending on $, is 


T 
= So yu loglp,(x:, B))- (18.90) 
t=1 


The estimator Ê that maximizes pie ,¢i(P) will be called the fixed effects Poisson 
(FEP) estimator. (Note that when y; = 0 for all ¢, the cross section observation i does 
not contribute to the estimation.) 

Obtaining the FEP estimator is computationally fairly easy, especially when 
m(Xiz,B) = exp(x;,8). But the assumptions used to derive the conditional log likeli- 
hood in equation (18.90) can be restrictive in practice. Fortunately, the FEP estimator 
has very strong robustness properties for estimating the parameters in the conditional 
mean. As shown in Wooldridge (1999a), the FEP estimator is consistent for f, under 
the conditional mean assumption (18.80) only. Except for the conditional mean, the 
distribution of yi given (x;,c;) is entirely unrestricted; in particular, there can be 
overdispersion or underdispersion in the latent variable model. The distribution of y; 
need not be discrete; it could be continuous or have discrete and continuous features. 
Also, there is no restriction on the dependence between y; and y, t # r. This is an- 
other case where the QMLE derived under fairly strong nominal assumptions turns 
out to have very desirable robustness properties. 

The argument that the FEP estimator is consistent under assumption (18.80) 
hinges on showing that f, maximizes the expected value of equation (18.90) under 
assumption (18.80) only. This result is shown in Wooldridge (1999a). Uniqueness 
holds under general identification assumptions, but certain kinds of explanatory 
variables are ruled out. For example, when the conditional mean has an exponential 
form, it is easy to see that the coefficients on time-constant explanatory variables 
drop out of equation (18.89), just as in the linear case. Interactions between time- 
constant and time-varying explanatory variables are allowed. 

Consistent estimation of the asymptotic variance of B follows from the results on 
M-estimation in Chapter 12. The score for observation i can be written as 


T 
si(B) = VY4(B = Dyed iLVppr(x:, B)'/p,(xi, B)] 


= Vpp(x;,B)'W(x:,B) {y; — p(x, Bai}, (18.91) 


where W(x;,f) = [diag{ p,(x:,B),-.-,pr(xi,B)}] u(B) = y; — P(x, B)ni, PXP) 
= [p\(x;,B),...,pr(x;,B)]|', and p,(x;,B) is given by equation (18.89). 
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The expected Hessian for observation i can be shown to be 


Ao = E[njVpp(Xi, By)'W(x:, Bo) Vep; B,)]- 


The asymptotic variance of B is A,'BA,|/N, where Bo = Els;(B,)si(B,)']. A con- 
sistent estimate of A is 


N 
A= N! XO niVpp(xi, Ê) W (xi, B)Vpp(xi, Ê) (18.92) 
i=l 


and B is estimated as 

A N A A 

B= N X si(Ê)s(Ê)'. (18.93) 
i=l 


The robust variance matrix estimator, A~'BA~!/N, is valid under assumption 
(18.80); in particular, it allows for any deviations from the Poisson distribution and 
arbitrary time dependence. The usual ML estimate, A~!/N, is valid under assump- 
tions (18.83) and (18.84). For more details, including methods for specification test- 
ing, see Wooldridge (1999a). 

Applications of the FEP estimator, which compute the robust variance matrix and 
some specification test statistics, are given in Papke (1991), Page (1995), and Gordy 
(1999). We must emphasize that, while the leading application is to count data, the 
FEP estimator works whenever assumption (18.80) holds. Therefore, y; could be a 
nonnegative continuous variable, or even a binary response if we believe the unob- 
served effect is multiplicative (in contrast to the models in Sections 15.8.2 and 15.8.3). 

Because the FEP estimator relies heavily on strict exogeneity of {Xp : t= 1,..., 
T}, conditional on c;, it is helpful to have a simple test of this assumption. A 
straightforward approach is to simply add w; ;,; to the model, usually an exponential 
model, where w; c x;;. Then, using the FEP estimator, a fully robust joint signifiance 
test for w; 41 can be used. A significant statistic indicates that the strict exogeneity 
assumption fails. Naturally, we lose the last time period in carrying out the test. One 
could even interact w; ;,; with certain elements of X; and include these interaction 
terms in the joint test. 


18.7.5 Relaxing the Strict Exogeneity Assumption 


If the test in the previous subsection rejects, we probably need to relax the strict 
exogeneity assumption. In place of assumption (18.80) we assume 


E(yi|Xi,---;Xi, Ci) = GM(Xir, Bo), CSD wey. (18.94) 
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These are sequential moment restrictions of the kind we discussed in Chapter 11. The 
model (18.94) is applicable to static and distributed lag models with possible feed- 
back, as well as to models with lagged dependent variables. Again, y;, need not be a 
count variable here. 

Chamberlain (1992b) and Wooldridge (1997a) have suggested residual functions 
that lead to conditional moment restrictions. Assuming that m(xi,B) > 0, define 


ral B) = vie — Vi n1 [MXi B)/m(%i.141, B)], t=1,...,7—-1. (18.95) 
Under assumption (18.95), we can use iterated expectations to show that 
Efra( Bo) | Xi,---;Xi] = 0. This expression means that any function of xj,..., X; is 


uncorrelated with rj;(8,) and is the basis fora GMM estimation. One can easily test 
the strict exogeneity assumption in a GMM framework. For further discussion and 
details on implementation, as well as an alternative residual function, see Wooldridge 
(1997a). 

Blundell, Griffith, and Windmeijer (1998) consider variants of moment conditions 
in a linear feedback model, where the mean function contains a lagged dependent 
variable, which enters additively, in addition to an exponential regression function in 
other conditioning variables with a multiplicated unobserved effect. They apply their 
model to the patents and R&D relationship. 

A different approach is conditional maximum likelihood, as we discussed in Sec- 
tions 15.8.4 and 17.8.3—see Section 13.9 for a general discussion. For example, if 
we want to estimate a model for yj; given (Zir, yi,:-1,¢;), Where Z; contains contem- 
poraneous variables, we can model it as a Poisson variable with exponential mean 
Ci exp(ZinBy + PoYi,t-1). Then, assuming that D( y; | Zi, Yi t—1,- -+ , Vio, Ci) = D( Yi | Zit, 
Yi.t-1, Ci), we can obtain the density of (y,...,vir) given (V,9, Zi, ci) by multiplica- 
tion; see equation (13.60). Given a density specification for D(c;| yi0, Zi), we can 
obtain the conditional log likelihood for each i as in equation (13.62). A very conve- 
nient specification is c; = exp(% + €oVi0 + Zi?9)ai, Where a; is independent of ( y;o, Zi) 
and distributed as Gamma(d,,0)). Then, for each ¢, yiş given (V; p1,- -9 Vi0, Zi, 4i) 
has a Poisson distribution with mean 


Gj EXP(% + Zibo + PoVi,t—-1 + SoVio + Zio): 


(As always, we would probably want aggregate time dummies included in this equa- 
tion.) It is easy to see that the distribution of (y;,...,yir) given (y;9,Z;) has the 
random effects Poisson form with gamma heterogeneity; therefore, standard random 
effects Poisson software can be used to estimate &o, o; Po; o; Yo; and dy. The usual 
conditional MLE standard errors, ź statistics, Wald statistics, and LR statistics are 
asymptotically valid for large N. See Wooldridge (2005b) for further details. 
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With an exponential mean, Windmeijer (2000) shows how to modify the moment 
conditions in (18.95) to estimate models with contemporaneously endogenous ex- 
planatory variables. Alternatively, in some cases a control function method would be 
available. For example, if we start with E( yj | Zi, Vio, Cil, Tin) = exp(Zind1 + HV + 
Cil + Tin), where cj is unobserved heterogeneity and rj is a time-varying omitted 
variable, then for continuous yi we might specify yiz = Yo + Zina + iE. + Vir, 
where we have imposed the Chamberlain-Mundlak device to allow heterogeneity 
affecting yj. to be correlated with z; through the time average, z;. Further, if 
we specify ca = Y1 +26; + aj, then we can write E( yj | Zi, Vio, vin) = exp(h, + 
Zin101 + Zi€, + UV + vin), where vin = an +r. While it is hardly general, it 
is not unreasonable to assume (vj, Vin) is independent of z;. If we specify 
Elexp(vin) | vie] = exp(y, + p1vin) (as would be true under joint normality), we 
obtain the estimating equation 


E( vin | Zi, Vin, Vi) = expa + 21nd) + H1Vin + LE, + py vir). (18.96) 


A two-step method is the following: (1) Obtain the residuals 6; from the pooled OLS 
estimation yj. on 1, zi, Z; across t and i. (2) Use a pooled NLS or QMLE (perhaps 
the Poisson) to estimate the exponential function, where (Z;,6j.2) are explanatory 
variables along with (zi, yi2). (As usual, a full set of time period dummies is a good 
idea in the first and second steps.) Alternatively, WMNLS can be used or GMM. 
One should adjust for the first-stage estimation, using the delta method or possibly 
the panel bootstrap, unless p} = 0. Rather than just obtain coefficient estimates, the 
APEs can be obtained by averaging the exponential function across (Z;, 62). The 
details are quite similar to the probit case in Section 15.8.5. 

Terza’s (1998) approach when y,,. is binary can be modified along similar lines 
using the Chamberlain-Mundlak device. Once one specifies a probit model of the 
form yip = lf. + Zinn + 7:6. + vin = 0], where vin is independent of z; with a 
standard normal distribution, and combines this with (18.96), obtaining pooled esti- 
mation methods or GMM methods is straightforward. 


18.7.6 Fractional Response Models for Panel Data 


We can also specify and estimate models with unobserved heterogeneity for frac- 
tional response variables. Following Papke and Wooldridge (2008), and for reasons 
similar to those in Section 18.6.2, it is easiest to work with the probit response func- 
tion, as specified in 


E( yi, | Xin ci) = B(XiB + ci), rer Be (18.97) 


The APEs that we are interested in are just as in the probit case, except that these are 
partial effects on a mean response (not a probability). 
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Without further assumptions, neither f nor the APEs are known to be identified. 
As with previous nonlinear models, a strict exogeneity assumption, conditional on 
the unobserved effect, is useful: 


E( Yi | Xi, ci) = E( Viz | Xir, ci), t= | eer (18.98) 


where x; = (xj1,...,X;r) is the set of covariates in all time periods. A second useful 
assumption (which could be made more flexible) is conditional normality using the 
Chamberlain-Mundlak approach: 


ci| (Xi, X2,--.,Xir) ~ Normal(p + X;é, 02), (18.99) 


where, as always, x; is the 1 x K vector of time averages. For some purposes, it is 
useful to write c; = Y +X;č +a; where a;|x; ~ Normal(0, 02). (Note that oł = 
Var(c;|x;), the conditional variance of c;.) Naturally, if we include time-period dum- 
mies in xy, as is usually desirable, we do not include the time averages of these in X;. 
Also, we may include time-constant variables in x; (omitting them from x;), provided 
we understand that we may not be consistently estimating their partial effects. 

Assumptions (18.97), (18.98), and (18.99) impose no additional distributional 
assumptions on D(y;,| xi, ci), and they place no restrictions on the serial dependence 
in { y} across time. Nevertheless, the elements of f} are easily shown to be identified 
up to a positive scale factor, and the APEs are identified. A simple way to establish 
identification is to write 


E( yi | Xi, ai) = OCW + Xah + Xig + ai), (18.100) 
and so 
E(yi,| Xi) = E[O(W + xB + K€ + ai) | Xi] = Ol(W + xap + ¥:€)/(1 + 02) "?] 
or 


E( yi | Xi) = P(Ya + Xüußa + Xiča), (18.101) 


where the a subscript denotes division of the orginal coefficient by (1 + a2) 1/2. Be- 
cause we observe a random sample on (Y; Xir, Xi), (18.101) implies that the scaled 
coefficients, w,, Pas and č, are identified, provided there are no perfect linear rela- 
tionships among the elements of Xx; and that there is some time variation in all ele- 
ments of x;y. (The latter requirement ensures that x; and X; are not perfectly collinear 
for all t.) In addition, it follows from the same arguments in Section 15.8.2 that the 
average structural function is 


Ex [O (Ya + Xba + Xi6a)] (18.102) 
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with respect to the elements of x,;. A consistent estimator, for given x;,, is 


N 
ASF(x,) = N! 5 OW, + xB, alr Hien), (18.103) 
i=] 


where Ê, is consistent for B,, and so on. APEs are obtained by differentiating or 
taking differences with respect to elements of x,. The panel data bootstrap is partic- 
ularly convenient for obtaining standard errors or confidence intervals for the APEs. 

The simplest \/N-consistent, asymptotically normal estimators are just the pooled 
Bernoulli quasi-MLEs, where the explanatory variables are a constant, a full set of 
time dummies (probably), Xx, and x;. Alternatively, we could also use pooled NLS. 
In either case, fully robust inference should be used because the variance associated 
with the Bernoulli distribution is likely to be wrong, and the variance is unlikely to be 
constant. More important, there is neglected serial correlation. 

Perhaps more efficient estimators can be obtained using the GEE approach as 
described in Section 12.9.2; see also Section 15.8.2. Papke and Wooldridge (2008) 
describe implementation in the current setup. An even better strategy is to use a 
minimum distance approach as described in Section 14.6.2. (And, of course, for 
additional flexibility, we can replace X;č, with x;é,.) 

If some elements of X; are not strictly exogenous, a control function method can be 
combined with the Chamberlain-Mundlak device. Papke and Wooldridge (2008) 
show how the same control function estimator described in the binary response pro- 
bit model in Section 15.8.5 applies to fractional responses with a continuous endog- 
enous explanatory variable. (The consistency of the estimator again relies on the 
Bernoulli distribution being in the linear exponential family.) Briefly, if we assume 
that the reduced form of the endogenous explanatory variable, y,,., can be expressed 
as Vin = Zi102 + Wy +262 + vin (where we impose the Chamberlain-Mundlak 
device), and we assume joint normality of all unobservables, including vin, then we 
can derive an estimating equation of the form 


E( vin | Zi, Vins Viz) = (tei Vir + Zinder + Wey + Zi€er + Ne Vir), 
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where the “e” subscript indicates the original parameters have been scaled—similar 
to the development in Section 18.6.2. As shown by Papke and Wooldridge (2008), 
these scaled coefficients appear in the APEs. Therefore, after obtaining the residuals 
Bi, these can be inserted in the pooled “probit” estimation in a second step to obtain 
consistent, /N-asymptotically normal estimators. The APEs with respect to (yp, Z1) 
are obtained by averaging the derivatives or changes across (Z;, 62). Standard errors 
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are easily obtained using the panel data bootstrap. See Papke and Wooldridge (2008) 
for more discussion. 


Problems 


18.1. a. For estimating the mean of a nonnegative random variable y, the Poisson 
quasi-log likelihood for a random draw is 


li(u) = yi log(u)- u, u>0 

(where terms not depending on u have been dropped). Letting 4, = E(y;), we have 
E[4(u)] = 4o log(u) — u. Show that this function is uniquely maximized at u = 4o- 
This simple result is the basis for the consistency of the Poisson QMLE in the general 
case. 


b. The gamma (exponential) quasi-log likelihood is 


li(u) =—yi/u—log(u), u>0 
Show that E[/;(u)] is uniquely maximized at u = py. 


18.2. Carefully write out the robust variance matrix estimator (18.14) when 
m(x, B) = exp(x£). 
18.3. Use the data in SMOKE.RAW to answer this question. 


a. Use a linear regression model to explain cigs, the number of cigarettes smoked 
per day. Use as explanatory variables log(cigpric), log(income), restaurn, white, 
educ, age, and age?. Are the price and income variables significant? Does using 
heteroskedasticity-robust standard errors change your conclusions? 


b. Now estimate a Poisson regression model for cigs, with an exponential conditional 
mean and the same explanatory variables as in part a. Using the usual MLE standard 
errors, are the price and income variables each significant at the 5 percent level? In- 
terpret their coefficients. 


c. Find ô. Is there evidence of overdispersion? Using the GLM standard errors, dis- 
cuss the significance of log(cigpric) and log(income). 

d. Compare the usual MLE LR statistic for joint significance of log(cigpric) and 
log(income) with the QLR statistic in equation (18.17). 

e. Compute the fully robust standard errors, and compare these with the GLM 
standard errors. 
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f. In the model estimated from part b, at what point does the effect of age on 
expected cigarette consumption become negative? 

g. Do you think a two-part, or double-hurdle, model for count variables is a better 
way to model cigs? 


18.4. Show that under the conditional moment restriction E(y |x) = m(x, ß,) the 
Poisson QMLE achieves the efficiency bound in equation (14.60) when the GLM 
variance assumption holds. 


18.5. Consider an unobserved effects model for count data with exponential re- 
gression function 


E(Ya | Xi- -+3 XiT, Ci) = ĉi EXP(XirB). 

a. If E(c|Xa,..., Xir) = expla + x;y), find E(y,,|xi,...,Xir). 

b. Use part a to derive a test of mean independence between c; and X;. Assume under 
Ho that Var( y; | xi, ci) = E( Yy | Xi, ci), that yi and y; are uncorrelated conditional on 


(Xi, ci), and that c; and x; are independent. (Hint: You should devise a test in the 
context of multivariate weighted nonlinear least squares.) 


c. Suppose now that assumptions (18.83) and (18.84) hold, with m(xj,f) = 
exp(x;,8), but assumption (18.85) is replaced by c; = a; exp(« + X;y), where a;|x ~ 
Gamma(d,0). Now how would you estimate $, «, and y, and how would you test 
Ho : y = 0? 


18.6. A model with an additive unobserved effect, strictly exogenous regressors, and 
a nonlinear regression function is 
E( Yi | Xi, ci) = ci + M(Xin, Bo), t=1,...,T. 


a. For each i and ¢ define the time-demeaned variables j,, = yi — J; and, for each 
B, malp) = m(Xi, B) — ADEL m(xi,B). Argue that, under standard regularity 
conditions, the pooled NLS estimator of p, that solves 


min Yi — aly? (18.104) 
i=l t=1 


is generally consistent and //N-asymptotically normal (with T fixed). (Hint: Show 
that E(j;, | xi) = iil Bo) for all t.) 

b. If Var(y; | x:,¢:) = @2I7, how would you estimate the asymptotic variance of the 
pooled NLS estimator? 
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c. If the variance assumption in part b does not hold, how would you estimate the 
asymptotic variance? 


d. Show that the NLS estimator based on time demeaning from part a is in fact 
identical to the pooled NLS estimator that estimates {c),@,...,cy} along with £,: 


y DS Yu — C1 — (Xi, B)]? (18.105) 


Thus, this is another case where treating the unobserved effects as parameters to 
estimate does not result in an inconsistent estimator of £,. (Hint: It is easiest to con- 
centrate out the c; from the sum of square residuals; see Section 12.7.4. In the current 
context, for given £, find ĉ; as functions of y,,x;, and p. Then plug these back into 
equation (18.105) and show that the concentrated sum of squared residuals function 
is identical to equation (18.104).) 


{c1, C2, 


18.7. Assume that the standard FEP assumptions hold, so that, conditional on 
(Xj, Ci), Vil,---, ir are independent Poisson random variables with means c;m(X;ir, Bo). 


a. Show that, if we treat the c; as parameters to estimate along with f,, then the con- 
ditional log likelihood for observation i (apart from terms not depending on c; or f) is 


llcn b) = log] f (Ya, -Yir |X; C1, B) 


T 
= X _{-em(xr, B) + yullog(c:)] + loglm(xir, )]}, 


=l 
where we now group c; with J as a parameter to estimate. (Note that c; > 0 is a 
needed restriction.) 


b. Let n; = ya +--+ yir, and assume that n; > 0. For given $, maximize 7;(c;, B) 
only with respect to c;. Find the solution, c;(P) > 0. 


c. Plug the solution from part b into /;[c;(#), £], and show that 


eB). B] = Sy» loglp,(xi,B)] + (mi — 1) logs). 


t=1 


d. Conclude from part c that the log-likelihood function for all N cross section 
observations, with (ci,...,¢y) concentrated out, is 


YD yn tetas (xi, B)] 1+ Den ; — 1) log(n;). 
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What does this mean about the conditional MLE from Section 18.7.4 and the esti- 
mator that treats the c; as parameters to estimate along with f,? 


18.8. Let y be a fractional response, so that 0 < y < 1. 

a. Suppose that 0 < y < 1, so that w = log[y/(1 — y)] is well defined. If we assume 
the linear model E(w|x) = xa, does E(y|x) have any simple relationship to xa? 
What would we need to know to obtain E(y |x)? Let @ be the OLS estimator from 
the regression w; on x;,i=1,...,N. 

b. If we estimate the fractional logit model for E(y|x) from Section 18.6.1, should 
we expect the estimated parameters, J, to be close to @ from part a? Explain. 

c. Now suppose that y takes on the values zero and one with positive probability. To 
model this population feature we use a latent variable model: 


y* |x ~ Normal(xy, a7) 
y=0, y* <0 
=y", 0<y*<1 


=l; y>l 


3 


How should we estimate y and g?? 


d. Given the estimate ĵ from part c, does it make sense to compare the magnitude of 
7; to the corresponding ĝ; from part a or the p, from part b? Explain. 


e. How might we choose between the models estimated in parts b and c? (Hint: 
Think about goodness of fit for the conditional mean.) 


f. Now suppose that 0 < y < 1. Suppose we apply fractional logit, as in part b, and 
fractional logit to the subsample with 0 < y; < 1. Should we necessarily get similar 
answers? 

g. With 0<y <1 suppose that E(y;|x;,y; > 0) = exp(x;d)/[1 + exp(x;6)]. How 
should we estimate ô in this case? 

h. To the assumptions from part g add P(y; = 0|x;) = 1 — G(xia), where G(-) is a 
differentiable, strictly increasing cumulative distribution function. How should we 
estimate E(y; | x;)? 


18.9. Use the data in ATTEND.RAW to answer this question. 


a. Estimate a linear regression relating atndrte to ACT, priGPA, frosh, and soph; 
compute the usual OLS standard errors. Interpret the coefficients on ACT and 
priGPA. Are any fitted values outside the unit interval? 
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b. Model E(atndrte | x) as a logistic function, as in Section 18.6.1. Use the QMLE for 
the Bernoulli log likelihood, and compute the GLM standard errors. What is ô, and 
how does it affect the standard errors? 


c. For priGPA = 3.0 and frosh = soph = 0, estimate the effect of increasing ACT 
from 25 to 30 using the estimated equation from part b. How does the estimate 
compare with that from the linear model? 


d. Does a linear model or a logistic model provide a better fit to E(atndrte | x)? 


18.10. Use the data in PATENT.RAW for this exercise. 


a. Estimate a pooled Poisson regression model relating patents to lsales = log(sales) 
and current and four lags of /rnd = log(1 + rnd), where we add one before taking the 
log to account for the fact that rnd is zero for some firms in some years. Use an ex- 
ponential mean function and include a full set of year dummies. Which lags of /rnd 
are significant using the usual Poisson MLE standard errors? 


b. Give two reasons why the usual Poisson MLE standard errors from part a might 
be invalid. 


c. Obtain ô for the pooled Poisson estimation. Using the GLM standard errors 
(but without an adjustment for possible serial dependence), which lags of /rnd are 
significant? 

d. Obtain the OLR statistic for joint significance of lags one through four of /rnd. (Be 
careful here; you must use the same set of years in estimating the restricted version of 
the model.) How does it compare to the usual LR statistic? 


e. Compute the standard errors that are robust to an arbitrary conditional variance 
and serial dependence. How do they compare with the standard errors from parts a 
and c? 


f. What is the estimated long-run elasticity of expected patents with respect to R&D 
spending? (Ignore the fact that one has been added to the R&D numbers before 
taking the log.) Obtain a fully robust standard error for the long-run elasticity. 


g. Now use the FEP estimator, and compare the estimated lag coefficients to those 
from the pooled Poisson analysis. Estimate the long-run elasticity, and obtain the 
usual FEP and fully robust standard errors. 


18.11. a. For a random draw i from the cross section, assume that (1) for each time 
period ¢, Vit |X;, Ci ~ Poisson(cjmj), where c; > 0 is unobserved heterogeneity and 
Mi > 0 is typically a function of only xj; and (2) (y;,..., ir) are independent con- 
ditional on (x;,c;). Derive the density of (,,,..., yir) conditional on (x;, c;). 
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b. To the assumptions from part a, add the assumption that c; |x; ~ Gamma (ô, ô), so 
that E(c;) = 1 and Var(c;) = 1/0. (The density of c; is h(c) = [6°/T(6)|c*! exp(—6c), 
where T (ô) is the gamma function.) Let s = yı +--+ yr and M; = ma +--+ + mir. 
Show that the density of (y;,,..., yır) given x; is 


J Ey 
( [i W? /T(5)|[P(M; + s)/(M; +6) °*), 
11 


(Hint: The easiest way to show this result is to turn the integral into one involving a 
Gamma(s + ô, M; + ô) density and a multiplicative term. Naturally, the density must 
integrate to unity, and so what is left over is the density we seek.) 


18.12. For a random draw i from the cross section, assume that (1) for each 1, 
Vit | Xi, ci ~ Gamma(m;,, 1/c;), where c; > 0 is unobserved heterogeneity and m;, > 0; 
and (2) (Ya, ---, Yir) are independent conditional on (x;, c;). The gamma distribution 
is parameterized so that E(y;,| xi, ci) = cymi and Var(y;, | xi, ci) = c? mit. 

a. Let s; = ya +--+: + yir. Show that the density of (y4, yi2,..-., vir) conditional on 
(Si, Xi, Ci) is 


T 
f (Vi, PT | Si, Xi, Ci) = from ++ ma) TẸ ro) 


T 
x (11 p) Aea , 
t=1 


where T (-) is the gamma function. Note that the density does not depend on c;. { Hint: 
If Y,,..., Yr are independent random variables and S = Yı +---+ Yr, the joint 
density of Yı,..., Yp given S=s is f,(,)---fr-1(¥r-1) fr(8 — 1 — ++ — yr-1)/ 
g(s), where g(s) is the density of S. When Y, has a Gamma(«,, 2) distribution for each 
t, so that f(y) = [2% /T(a,)] y TÀ exp(—Ay,), S ~ Gamma(o +--+ + ar, 4).} 

b. Let m,(x;,B) be a parametric function for m,—for example, exp(x;f8). Write 
down the log-likelihood function for observation i. The conditional MLE in this case 
is called the fixed effects gamma estimator. 


18.13 Let y be a continuous fractional response variable, that is, y can take on any 
value in (0,1). For a 1 x K vector of conditioning variables x, where x; = 1, suppose 
that the conditional density of y; given x; = x is 


fO |X; Bo) = exp(xp,)yePrOP)-N Oy <l. 
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It can be shown that, for this density, E( y; | x;) = exp(xif,)/[1 + exp(xi£,)]—that is, 
it is the logistic function evaluated at x;f,. 

a. For a random draw i from the population, write down the log-likelihood function 
as a function of £. 

b. Find the K x 1 score vector, s;(). 

c. For this density, it can be shown that E[log(y;) | x;] = —1/exp(x;f,). Why does it 
makes sense for the conditional expectation to be negative? 

d. Verify that E[s;(f,) | x;] = 0. 

e. Find —E/H;(£,) | xi]. 

f. How would you estimate Avar VN(B — B,)? Be very specific. 

g. Is the MLE consistent for f, if we only assume the conditional mean is correctly 
specified, that is, E( y; | x;) = exp(x;P,)/[1 + exp(xif,)]? (Hint: Look at E[s;(B,) | xi.) 
h. If you are only interested in E(y;|x;), what might you do instead of MLE? 


18.14. Use data in 401KPART.RAW for this question; it is similar to the data set 
used in Papke and Wooldridge (1996), except it includes the number of employees 
eligible to participate in the 401(k) pension plan. 


a. Let y;= partic; and n;=employ;, and use binomial QMLE to estimate 
E(y; | mi, x;) = n;A(xiB), where A(-) is the logistic function and x includes a constant, 
mrate, Itotemp, age, agesq, and sole. Obtain three sets of standard errors: those based 
on (18.34) with ø? = 1 (which holds if the distribution is actually binomial), those 
from (18.34) with o? estimated, and the fully robust standard errors. Comment on 
how they compare. 

b. Now use prate in a fractional logit analysis using the same vector x in part a. 
Again, compute three sets of standard errors and discuss how they compare. 

c. Explain why it makes sense to compare the coefficient estimates from parts a and 
b. Are their important differences in the coefficients, particularly for the key variable 
mrate? Which approach produces the more precise estimate? 


d. Compute the APE for mrate on E(prate|x) using the estimates from parts a and 
b. 


e. Using fractional logit, estimate the APE on E( prate |x) when mrate goes from .25 
to .50. 

f. Add mrate* to the fractional logit estimation. Is there a strong case for including 
it? Explain. 
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18.15. For 0 < y;, < 1 consider the model 

log[Yie/(1 — Vit)] = Xah + Ci + tii 

E(ui | Xi, ci) = 9, io Laren ie 

a. Assuming the {x : t= 1,...,7} are time varying, how would you estimate f? 

b. Let vy = ci + ug and write y; in terms of Xp + viz. Find the average structural 
function for y, (as a function of x,) as an expectation over the distribution of vj,. 

c. Without further assumptions, can you consistently estimate the ASF? Explain. 

d. Assume that c; = Y + X;č + a; where ri = a; + uj, is independent of x;. Explain 
how to consistently estimate the ASF in this case. 


l 9 Censored Data, Sample Selection, and Attrition 


19.1 Introduction 


In previous chapters we assumed that we can obtain a random sample from the 
population of interest. For example, in Part I, where we studied models linear in the 
parameters, we assumed that data on the dependent variable, the explanatory vari- 
ables, and instrumental variables can be obtained by means of random sampling— 
whether in a cross section or panel data context. In earlier chapters of Part IV we 
studied various nonlinear models for response variables that are limited in some way. 
Chapter 15 extensively considered binary response models, and we saw that the most 
commonly used models imply nonconstant partial effects. The same is true for corner 
solution responses in Chapter 17 and those for count and fractional responses in 
Chapter 18. It is critical to understand that our reason for looking beyond linear 
models in those chapters is to obtain functional forms that are more realistic than 
models that are linear in parameters. 

In this chapter we turn to several missing data problems. It is critical—even more 
so than in the previous chapters—to distinguish between assumptions placed on the 
population model and assumptions about how the data were generated. Under ran- 
dom sampling, the interesting issues concern the population model and assumptions 
we make about distributional features in the population. With nonrandom sampling, 
we must take particular care in stating assumptions about the population and sepa- 
rately stating assumptions on the sampling scheme. 

We first study the general problem of data censoring. With data censoring we can 
still randomly sample units from the relevant population, but we face the problem 
that one or more of the variables is censored: we only observe the response over a 
certain range—sometimes a very limited range and sometimes a broad range. We 
cover several examples of data censoring in Section 19.2. 

In addition to data censoring, we also treat the general problem of sample selec- 
tion. Section 19.3 begins with a general discussion of examples of missing data 
schemes, and Section 19.4 establishes when the missing data problem can be ignored 
without resulting in inconsistent estimators. With sample selection problems, we may 
or may not be able to randomly sample units from the population. The case of data 
truncation, which we study in Section 19.5, is a situation where we do not randomly 
draw units from the population; rather, we randomly sample from a subpopulation 
defined in terms of one or more of the observed variables. Naturally, the population 
parameters cannot always be identified with such sampling schemes, but they can be 
under suitable assumptions. 

Another sample selection problem is incidental truncation, where certain variables 
are observed only if other variables take on particular values. In such cases, we often 
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can randomly sample units, but we have data missing on key variables (and, unlike in 
the data censoring case, we have no information on the outcome, or even a range of 
possible outcomes, of the missing data). We treat this case in Sections 19.6 and 19.7. 

Most of the methods that allow for sample selection to be systematically correlated 
with unobservables rely on linearity or some other simple response function (such as 
exponential). An alternative approach is inverse probability weighting, which can be 
applied to general missing data problems if we have good enough predictors of 
selection so that, conditional on those predictors, selection is appropriately exogenous. 
We cover inverse probability weighting in the context of M-estimation in Section 
19.8. Section 19.9 covers sample selection, including the specific problem of attrition, 
in panel data applications. 

Before we treat specific censoring and selection schemes, it is important to under- 
stand the notational conventions in this chapter. As in previous chapters, we continue 
to let y denote the response variable in the population of interest. Therefore, we will 
write population models with y as the dependent variable. The distinction between 
the underlying response variable of interest, and what we can observe, is critical 
throughout this chapter. For example, in Section 19.2 we consider the case where we 
can randomly sample from the population but we only observe a censored version of 
y. If w denotes the censored outcome, then for a randomly drawn unit i we observe 
w;, but this may not equal yj. 

For models with endogenous explanatory variables, we continue to use the con- 
ventions of the previous chapters, letting, for example, y; and yz denote the endog- 
enous variables in the underlying population (where yı is typically the response 
variable and y2 is an endogenous explanatory variable). In principle, each of these 
could be censored, and we use w; and w2, respectively, to denote the censored 
outcomes. 


19.2 Data Censoring 


Traditionally, data censoring problems are treated within the frameworks of Chap- 
ters 15—17—alongside binary responses, multinomial responses, and corner solution 
responses. The benefit of a parallel treatment is that it economizes on the presenta- 
tion, but it has significant costs: empirical researchers tend to let the two very differ- 
ent issues of functional form for limited dependent variables and data censoring 
problems blend together. Thus, although the statistical models used for limited de- 
pendent variables and censored dependent variables are similar, their interpretation, 
as well as the way one views and reacts to violations of standard assumptions, is very 
different. Because the statistical tools for handling censored data are very similar to 
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specifying and estimating models for limited dependent variables, our coverage of the 
estimation details can be terse. 

As an example, consider the problem of top coding, where a variable is reported 
only up to a specified ceiling. For outcomes above the ceiling, all we know is that the 
outcome was above the ceiling. Common examples are survey data on wealth and 
income. In order to elicit responses from wealthy people, some surveys only ask 
about the amount of wealth up to a given threshold, allowing wealthy people to 
simply indicate if their wealth is above the threshold. Probably the underlying popu- 
lation model in this case is a standard linear model, but that is a separate issue. To 
emphasize this point, we might also use top coding when collecting data on charitable 
contributions. As we saw in Chapter 17, in a large population the charitable con- 
tributions is best described as a corner solution outcome, with corner at zero. A sen- 
sible population model for charitable contributions is the Tobit model, or perhaps 
one of the two-part models we covered in Section 17.6. Specification of this popula- 
tion model is separate from the data collection scheme. If we top code contributions 
at, say, $10,000 (where contributions are measured in $1,000s), then the reported 
contributions data will appear to have two corners: one at zero and one at 10. Of 
course, these are very different in nature: when we observe a zero, we know that the 
person had zero charitable contributions. If we observe 10, we only know that the 
contributions were at least $10,000. If the top coding were instead at $20,000, noth- 
ing would happen to the population distribution of contributions; it still has a corner 
at zero. But the top coding changes the upper corner, and by increasing the censoring 
value we observe more of the population distribution of contributions. As we will see 
in Section 19.2.3, the statistical model for estimating the parameters when a Type I 
Tobit variable is top coded is equivalent to a two-limit Tobit; but the underlying 
model is the standard Type I Tobit model. 

Although we will touch on more complicated situations, such as the Tobit model 
just described, our main focus on this chapter is on a very familiar model: the linear 
model from introductory econometrics. Let y denote the variable of interest, and 
assume that it follows a standard linear model in the population: 


y=xß+u (19.1) 
E(u|x) = 0, (19.2) 


where x is | x K with first-element unity. Under assumptions (19.1) and (19.2), if we 
have random draws (x;, y;) from the population, then OLS is consistent and vN- 
asymptotically normal for the parameters of interest, p. 

The problem we study in this chapter occurs when we observe only a censored 
version of y. In the top coding example, suppose that wealth is measured in 
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thousands of dollars and is top coded at $200,000. Then we can define the censored 
version of wealth (for any unit that can be drawn from the population) as w = 
min(y, 200). Of course, for each random draw i, w; = min( y;, 200). If we are pre- 
sented with the data set on x; and w;, where w; is called “wealth,” we should notice 
that the maximum value of “wealth” in the sample is 200, with a nontrivial fraction 
of observations at exactly 200. Because there is no behavorial reason to see a focal 
point for wealth at 200—let alone, to observe no values greater than 200—we would 
recognize that the wealth variable has been top coded at 200. 


19.2.1 Binary Censoring 


We first cover the case where the censoring of the underlying response variable is 
extreme. As an example, suppose we want to model willingness to pay (WTP) for a 
proposed public project. Assume that the underlying model is as in equations (19.1) 
and (19.2) with y = wtp. When we draw family (say) i from the population, we would 
like to observe (x;, wtp;); if we did so for all i, we would estimate B by OLS. But 
willingess to pay can be difficult to elicit, and reported amounts might be noisy. In- 
stead, suppose that each family is presented with a cost of the project, r;. Presented 
with this cost, the household either says it is in favor of the project or not. Thus, 
along with x; and r;, we observe the binary response 


wi = l[y; > rij, (19.3) 


where we assume, for now, that the chance that y; equals r; is zero. 

What is the most natural way to proceed to estimate 8? If we impose some strong 
assumptions on the underlying population and the nature of r;, then we can proceed 
with maximum likelihood. In particular, assume 


ui |X; ri ~ Normal(0, o°). (19.4) 


Assumption (19.4) implies that y; = x; + u; actually satisfies the classical linear 
model (CLM). It also requires that r; is independent of y; conditional on x;, that is, 


D(yi| xi, 71) = D(yi| Xi). (19.5) 


Assumption (19.5) is satisfied if r; is randomized or if r; is chosen as a function of x;, 
or some combination of these alternatives. 
Given assumption (19.4), 


P(w; = 1 |x; ri) = P(yi > ri | Xn ri) = Plui/o > (ri — xiB)/o| xi, ri] 
= 1 — O[(r; — x;B)/o] = ®[(x;B — r;)/o]. (19.6) 
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We can see that the binary response, which indicates whether unit 7 is in favor of the 
project at cost r;, follows a probit model with parameter vector B/o on x; and —1/¢a 
on ri. (In almost all applications x; would include an intercept, and we allow that 
possibility here.) Therefore, all the parameters, including c, are identified provided 
x; is not perfectly collinear and r; varies across į in a way not perfectly linearly related 
to x;. Given the data censoring, the maximum likelihood estimators (MLEs) are the 
asymptotically efficient estimators of B and ø (or a”). Because the underlying popu- 
lation model is linear, we are interested in the slopes, £;. In the next section we show 
that the binary censoring problem is a special case of interval censoring with unit- 
specific thresholds. 

The costs of the binary censoring scheme are potentially severe. If we could ob- 
serve y;, specifying E(y;|x;) = x; would suffice for consistent estimation of f; in 
fact, we could just specify a linear projection and use OLS. With censoring, we must 
add more, and assumption (19.4) implies that the underlying model satisfies the 
CLM. It is in this setting where discussions of the deleterious effects of nonnormality 
and heteroskedasticity when using probit models make sense. In Chapter 15 we fo- 
cused on the case where the binary response is the variable we want to explain, in 
which case we are interested in estimating the partial effects on the response proba- 
bility. In that setting, heteroskedasticity and nonnormality in the error of the latent 
variable model change the functional form of the quantity of interest, and so the 
relevant issue concerns how those two problems affect the estimated partial effects 
(and the implications they have for a standard probit model, or even a linear proba- 
bility model, for estimating the partial effects). With data censoring, we are interested 
only in the parameters in the underlying linear model. Consequently, it is now legit- 
imate to be concerned about the effects on the parameter estimates of hetero- 
skedasticity or nonnormality in the underlying linear model. 

In Section 15.7.6 we discussed various ways of estimating parameters up to scale 
without placing strong restrictions on D(u;|x;,1r;). Those estimators can be used in 
the present context. For example, if the distribution of (x;, r;, yi) implies linear con- 
ditional expectations for all elements conditional on y;, then the Chung and Gold- 
berger (1984) results can be used for OLS. Ruud’s (1983, 1986) findings can be 
applied when u; is independent of (x;,r;) if the distribution of u; is misspecified. 
Manski’s (1975, 1988) maximum score estimator requires only symmetry of 
D(u;|x;,7;) (around zero), and Horowitz’s (1992) smoothed version is more conve- 
nient for inference. But in every case these methods only estimate the slope coef- 
ficients up to a common scale factor (and the intercept cannot be estimated at all); 
therefore we cannot learn the magnitude of the effect of any element of x on will- 
ingness to pay, nor do we have a way of predicting willingness to pay for given values 
of the covariates. 
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A linear model for willingness to pay is not ideal because wtp > 0. If wtp is zero for 
some subset of the population, a sensible population model is the Type I Tobit: 


y = wtp = max(0,xf+ u), (19.7) 


under assumption (19.4). If we had a random sample, we would use Type I Tobit 
MLE to estimate $ and o? (naturally, r; would not come into play), and we would use 
the MLEs to estimate the means, say, from a Type I Tobit. Interestingly, if we have 
binary censoring, the estimation procedure is identical to that outlined for a linear 
model for wtp, provided that the r; are all strictly positive. Because we do not observe 
yi, we cannot distinguish between equations (19.1) and (19.7) when r; > 0. But if we 
believe that y is zero for a nontrivial fraction of the population, any calculations 
should reflect that belief by using the Type I Tobit formulas for estimating partial 
effects. 

One way to possibly determine whether WTP is ever zero in the population is to set 
some r; to zero so that the outcome w; = 0 means wtp; = 0. Or, we might set some r; 
sufficiently ‘“‘close” to zero so that w; = 0 practically means wtp; = 0. Of course, this 
approach requires a particular survey design before the data have been collected. 

If we really think y = wtp > 0 in the population—say, everyone in a city would be 
willing to pay at least a small amount for a new park—then the underlying popula- 
tion model should be something like 


y = exp(xf + u), (19.8) 


under assumption (19.4). Then, log( y) = xf + u, so that the previous analysis applies 
but with r; replaced with log(r;) (and we are back to assuming r; > 0). Interestingly, 
the change in the functional form for P(w; = 1 | x;, r;) from equation (19.6) to 


P(w; = 1 |x; r;) = ®[(x;8 — log(r;))/a] (19.9) 


provides a way to distinguish between assumptions (19.7) and (19.8) as the underly- 
ing population models when r; > 0 for all i. 

Naturally, the previous models apply to cases other than willingness to pay. But if 
y cannot take on negative values, the linear model we started with in equation (19.1) 
is not ideal. Assumption (19.7) leads to the same estimation method but implies a 
Tobit form for E(y |x). If we think y > 0 in the population, then equation (19.8) is 
more attractive, and we can use equations (19.6) and (19.9) to distinguish between 
them based on values of the log-likelihood functions. In fact, because we know the 
coefficient on r; in the first case, and log(r;) in the second, must be nonzero, we 
can use Vuoung’s (1989) model selection test to choose between them; see Section 
13.11.2. 
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19.2.2 Interval Coding 


We now consider the linear model (19.1) in a scenario where the continuous, quan- 
titative outcome, y, is only recorded to fall into a particular interval. In this case we 
say we have interval-coded data (or interval-censored data). We are still interested in 
the population regression E(y |x) = xf. Let rı < r2 < --- < ry denote the known in- 
terval limits; these are specified as part of the survey design. For example, rather than 
asking individuals to report actual annual income, they report the interval that their 
income falls into. 

Under the normality assumption in equation (19.4), we can estimate f} and o”. Not 
surprisingly, the structure of the problem is similar to the ordered probit model we 
covered in Section 16.3. In fact, we can define 


w=0 ify<nr 


w=1 ifm<y<nr 
(19.10) 


w=J ify>ry 


and easily obtain the conditional probabilities P(w = j |x) for j = 0,1,...,/. The log 
likelihood for a random draw i is 


li(B,o) = 1[w; = 0] log{®[(r1 — xiB)/o]} + L[w; = 1] log{ ®[(r2 — xiB)/o 
— O[(r; — xP) /o]} +--+» + L[w; = J] log{1 — ®[(r; — x;B)/a]}. (19.11) 


The maximum likelihood estimators, # and 6”, are often called interval regression 
estimators, with the understanding that the underlying population distribution is 
homoskedastic normal. 

Although equation (19.11) looks a lot like the log likelihood for the ordered probit 
model, there is an important difference: in ordered probit, the cut points are param- 
eters to estimate, and the parameters f do not measure interesting partial effects. 
With interval regression, the interval endpoints are given (or are themselves data), 
and $ contains the partial effects of interest. In particular, as in the case of binary 
censoring, when we obtain the interval regression estimates, we interpret the B as if 
we had been able to run the regression y; on x;, i= 1,..., N. Imposing the assump- 
tions of the classical linear model allows us to estimate the parameters in the distri- 
bution D(y |x), even though the data are interval censored. 

Sometimes in applications of interval regression the observed, censored variable, w, 
is set to some value within the interval that contains y. For example, if y is wealth, 
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we might set w to the midpoint of the interval that y falls into. (Of course, we have to 
use some other rule if y < rı or y > rz.) Provided the definition of w determines the 
proper interval, the maximum likelihood estimators of f and a will be the same. 

When w is defined to have the same units as y, it is tempting to ignore the grouping 
of the data and just to run an OLS regression of w; on x;, i= 1,...,N. Naturally, 
such a procedure is generally inconsistent for f. Nevertheless, the results of Chung 
and Goldberger apply: if E(x| y) is linear in y, a linear regression can estimate the 
slope coefficients up to a common scale factor. 

Sometimes the interval limits change across i, a possibility that causes no problems 
if we assume the limits are exogenous in the following sense: 


D(yi| Xara,- iu) = D(yi | xi). (19.12) 


In the binary censoring example from the previous section, assumption (19.12) holds 
because the one limit value (J = 1) is randomly assigned. Generally, the limits can be 
a function of x; (because these are being conditioned on). The resulting log likelihood 
is exactly as in equation (19.11) with 7; replaced with r;j. Some econometrics packages 
have “interval regression” commands that allow one to specify the lower and upper 
endpoints for each unit i, allowing for unit-specific endpoints. 

Because of the underlying normality assumption, we can use the Rivers and Vuong 
(1988) and Smith and Blundell (1986) control function approach to test and correct 
for endogeneity of explanatory variables. The underlying model is the standard linear 
model 


yı = 7)0| Hya + Ui (19.13) 


and we observe the censored variable, wı. Given the linear reduced form y= 
zô2 + v2, we proceed as before: just add the first-stage residuals, ĉ2, to the interval 
regression model, along with (z1, y2). Of course, we are interested in « and 6), along 
with the coefficient on 62 to determine whether yz is in fact endogenous. Unfortu- 
nately, such an approach only works when jy; is not censored. It is very difficult to 
account for interval censoring of y2 along with that for y. 

Modifying interval regression for linear, unobserved-effects panel data models is 
straightforward, provided we are willing to rely on the Chamberlain-Mundlak device. 
We would write 


Vit = Xuß + Ci + üi, AE NOTI ba (19.14) 
ci = Y + Xič + ai, (19.15) 


where all unobservables have normal distributions and the interval limits, {rj: j = 
1,2,...,J}, can vary by i and t. Estimation can be carried out under the assumption 
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of serial independence in {uj;: t= 1,..., T}, so that the log likelihood has a random 
effects structure, or without imposing any assumption on the serial dependence 
(which leads to pooled estimation of the type we covered in Chapters 15 and 16). 


19.2.3 Censoring from Above and Below 


Two kinds of censoring are common: one is seen when the value of a variable is 
observed only when it is below a known cap, the other when it is above a known floor 
(or, in some cases, both). Consider first the case of right censoring (censoring from 
above). For each i, a censoring threshold, r;, is observed. When we randomly draw a 
unit from the population, we observe the explanatory variables, x;. However, rather 
than observing the outcome on y;, we effectively observe 


wi = min( y;, ri). (19.16) 


If the underlying population distribution for y is continuous, the probability that 
yi = r; is zero, and so, if w; = r;, we know that the observation is censored; if w; < ri, 
we know that w; = y;; that is, we observe y;. If y; is (partially) discrete, there can be 
positive probability that y; = r;. As a practical matter, when the censoring points 
change across i, it is helpful in data sets to define a binary variable indicating whether 
an observation is censored. 

For concreteness, we focus on the case where the population distribution of y; is 
continuous. Let f(y|x;0) denote the conditional density. Under D(y;|x;,r;) = 
D(y;|x;), we can easily obtain the density of w; conditional on (x;,r;) because, for 
w< fi, 


P(w; < w| xiri) = P(y;: < w| x) = F(w|x;0), 


where F(-|x;;0) is the cdf of y; conditional on x;. Therefore, the probability density 
of w; given (x;,r;) is simply f(w|x;; 0) for w < r;, that is, for values strictly less than 
the censoring point. Further, 


P(w; = ri | x5r) = P(yi > ri | Xi, 71) = 1 — Fri | xi; 8). 
It follows that we can write the probability density of w; given (x;,r;) as 
g(w| xi, rg 0) = [fw x; O P — FO: | x; 0)". (19.17) 


The log-likelihood function for a random draw i (where we do not bother to distin- 
guish between the “true” value of theta and a generic value) is 


log[g(w | x; r; 0)] = L[w; < ri] log[f(w; |x; 8)] + L[w; = ri] logll — F (r; |x; 4)], 
(19.18) 
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and we sum this expression across all 7 to obtain the log likelihood for the entire 
sample. Some authors prefer to write the log likelihood separately for the uncensored 
and censored observations, but equation (19.18) is actually preferred because it shows 
that the units are being randomly sampled from a population. Here, we are drawing 
(wi, Xi, r). In the vast majority of cases, the conditions sufficient for MLE to be well 
behaved covered in Chapter 13 hold for censored estimation because the model 
J (y |x; 6) is smooth in 8. 

An interesting feature of equation (19.18) is that we only need to observe the cen- 
soring point, r;, for censored observations. (However, we do need to know which 
observations are censored and which are not.) This feature of the MLE approach is 
useful because sometimes in applications to duration models—see Chapter 22—the 
censoring value is reported only for observations that are actually censored. 

In the leading case, y follows a classical linear model in the population of interest, 
that is, 


D(y|x) = Normal(xf, o°), (19.19) 


in which case we have what is typically called the censored normal regression model. 
(In the econometrics literature, this is sometimes called the censored Tobit model or 
Type I Tobit model, but we are reserving those names for the case of corner solution 
responses; see Chapter 17.) The log likelihood for the censored normal regression 
model is 


LO) = 1[w; < r] log{o gfw: — x;B)/a]} + Lw; = r} log{1 — ®[(w; — x,B)/o]}. 
(19.20) 


Many standard econometrics packages estimate this model with little computational 
difficulty. Often, a transformation, such as taking the natural log, is needed to make 
equation (19.19) a reasonable assumption. In cases where no such transformation is 
available that ensures normality, the more general formulation in equation (19.18) 
can be used to obtain the log likelihood. 

Distinguishing between the underlying population model and the censoring scheme 
can lead to some perhaps surprising implications for econometric practice. For ex- 
ample, suppose that survey data are collected on charitable contributions, where 
contributions are censored at a fixed cap—let us say $10,000 a year for concreteness. 
In the population, there is no natural upper bound for charitable contributions, so 
one knows immediately that the observed pileup at $10,000 in the survey data is due 
to built-in data censoring. It is proper to say that charitable contributions are “‘cen- 
sored from above at $10,000.” But the pileup at zero is a different matter: it is due to 
the fact that some fraction of the population will have zero charitable contributions 
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in a particular year. Therefore, an appropriate course of action is to treat charitable 
contributions in the population as a corner solution response, with a corner at zero. 
(We might use a Tobit model, or one of the two-part models discussed in Section 
17.6.) Unlike the censoring from above at $10,000, it makes no sense to say charita- 
ble contributions are also “censored from below at zero.” There is a corner at zero, 
but it is not due to data censoring. 

Interestingly, in the situation just described, if we assume a Type I Tobit model in 
the population for charitable contributions, the log-likelihood function for the right- 
censored charitable contributions is identical to that for the two-limit Tobit model 
discussed in Section 17.7. Estimation can be done rather straightforwardly by speci- 
fying zero and 10,000 as the lower and upper bounds, respectively. The practically 
important issue concerns what we do with the estimates. In fact, after obtaining the 
estimates, all calculations—of response probabilities, expected values, and so on— 
should be based on the Type I Tobit model (ignoring the right censoring). We are 
interested in features of D(y |x) in the population, where y is actual charitable con- 
tributions. Therefore, after accounting for the top coding in estimation by using a 
two-limit Tobit, we revert to the Type I Tobit for all statements about charitable 
contributions. The parameters themselves are of interest insofar as they allow us to 
compute partial effects on probabilities, means, and medians. But the corner at 
$10,000 plays no role in such calculations. Similar comments hold for two-part 
models for corner solutions. 

A more subtle example occurs when, say, an upper limit is imposed by law. Con- 
sider a stylized case where individuals may contribute no more than 15% of their in- 
come to retirement plans. In the population of working people, some individuals will 
contribute zero, some will contribute at the 15 percent upper limit, and many will 
contribute a percentage strictly between zero and 15. If we are interested in the effect 
of explanatory variables—say, taking courses on retirement savings—on the contri- 
bution under the current legal regime, a two-limit Tobit model makes sense for the 
contribution percentage. If we want to know, say, the mean or median difference in 
the rate between those who have and have not had a retirement savings course, we 
would use the formulas for the two-limit Tobit model. However, one might want 
to know the effects of covariates on the contribution percentage in the absence of 
institutional constraints. Then we would be back to the previous situation: the corner 
at zero is a corner that arises from utility maximization, but the corner at 15 is 
imposed—in this case, by law rather than the data collection scheme. In this case, 
one would use formulas for a Type I Tobit with corner at zero, even though the 
estimates necessarily are obtained from a two-limit Tobit. In this application, an 
additional calculation may be of interest: what would be the effect on the average 
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contribution rate if the limit were increased? We can use the formulas for a two-limit 
Tobit in this case to compute the appropriate derivative. 

Not surprisingly, endogenous explanatory variables with censored data can be 
handled using methods very similar to those for the Type I Tobit model. For exam- 
ple, suppose the population model is 


yı = 20, + %1y2 + u, (19.21) 


where D(u; | z) is Normal(0, a7). However, the data are right censored. If the reduced 
form for y2 is y2 = zd, + v2, where (wu, v2) is independent of z and bivariate normal, 
we can apply the Smith and Blundell (1986) approach to account for the right cen- 
soring. This assumes that yz is not censored, so that the first step is OLS of y2 on z 
using a random sample. The residuals, #2, are added to the censored normal regres- 
sion in the second stage. Of course, because the underlying population model is 
linear, we are interested in « and 6;. Joint MLE is possible, too, and would be more 
efficient and avoid the problem of inference after two-step estimation. (Of course, the 
bootstrap can be applied here by randomly drawing units from the sample. Remem- 
ber, we have a random sample of units. We simply include the first-step estimation 
and censored normal estimation within each bootstrap iteration.) 

With enough normality, it should not be surprising that the Chamberlain-Mundlak 
device can be used in the context of right and left censoring. Again, for simplicity 
consider the linear model y; = x8 + ci + uz, where right censoring is at ri, which 
can change across į and ¢. Natural exogeneity assumptions, along with convenient 
distributional assumptions, are 


D(wit | Xiti, ---, Fir, Ci) = D(uix) = Normal(0, 02), t=1,...,T (19.22) 
D(ci | Xinran,- rir) = D(c:|x;) = Normal(p + X;é, 02). (19.23) 
As usual, these assumptions mean we can write 


Vi = Y + Xuß + KE + vin, D (vi |X; r;) = Normal (0, 0? + o2), (19.24) 


aa 


where r; = (rj, ..-,r;r) is the vector of censoring values for unit i. Now, we can just 
apply pooled censored normal regression, with censoring points r;,, and consistently 
estimate y, B, č, and a? = a? + a7. Because this is a partial likelihood method, we 
need to make the inference robust to serial correlation. Generally, we cannot sepa- 
rately identify ož and a? unless we make the further assumption that {uj:: t= 1,..., 
T} is serially independent, in which case we can use a correlated random effects 


(CRE) likelihood approach similar in structure to the CRE Tobit model. Naturally, 
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with the underlying population model linear, we are mainly interested in } and 
appropriate inference concerning £. 

All the methods just discussed require a full distributional assumption. (In the 
panel data case, this statement applies to D(y;,|x;,c;), not to a joint distribution 
across t.) Certainly it is of interest to relax such assumptions when possible. (In Sec- 
tion 19.8, we will discuss ways of using probability weights to relax distributional 
assumptions.) Powell’s (1984) censored least absolute deviations (CLAD) estimator 
can be applied to censored data without putting strong restrictions on D(y |x). To 
see how, begin with a linear model for the conditional median, 


Med(y|x) = xf. (19.25) 


(Of course, this may or may not be the conditional mean. Powell’s approach applies 
to the conditional median.) Again, assume that our random sample consists of 
(X; ri, wi), where w; = min(y;,r;). LAD can be applied to the censoring case because 
the median passes through the min function: 


Med(w; | x;,7;) = Med[min( y; | x;, ri), r;] = min|Med( y; | x;),r;] = min(x;f,r;), 
(19.26) 


where the second equality holds under the assumption that Med(y;| xj, ri) = 
Med( y; | x;)—-which means censoring is exogenous with respect to y; in the condi- 
tional median sense. As before, the censoring values can be related to x;. Given 
equation (19.26), we can use LAD to estimate £, resulting in the CLAD estimator: 


N 
min 2 |w; — min(x;b, r;)|: (19.27) 


As discussed in Chapter 17, Powell (1984) shows that CLAD is consistent and VN- 
asymptotically normal. His paper contains formulas for estimating the asymptotic 
variance of the CLAD estimator. 

One subtle point concerning the CLAD estimator when applied to censored data is 
that it requires that the censoring value, r;, be available even when the observation is 
not right censored. This data requirement is not much of an issue in top-coding cases, 
especially when the same value is used. (For example, if wealth is top coded at 
$500,000, that information is known, and r; = 500,000 for all i.) However, in some 
duration problems—which we treat explicitly in Chapter 22—only w; is observed. 
That is, along with a censoring indicator, we observe either y; or r;, but not both. 
Recall that the maximum likelihood estimator can be applied in situations where r; is 
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not always observed: we only need to observe r; when the observation is actually 
censored. 


19.3 Overview of Sample Selection 


We now turn to estimation when a sample from subset of the population is used to 
estimate the unknown parameters. The term selected sample is generally used to 
describe a sample that is not randomly drawn from the underlying population. As 
mentioned in Section 19.1, there are a variety of selection mechanisms that result in 
selected samples. Some mechanisms are due to sample design, while others are due to 
the behavior of the units being sampled, including nonresponse on survey questions 
and attrition from social programs. 

Before we launch into specifics, there is an important general point to remember: 
sample selection can only be an issue once the population of interest has been care- 
fully specified. If we propose a model for a subset of a larger population, it is proper 
to proceed by obtaining a random sample from that subpopulation and then using 
the standard econometric methods that we have covered thus far. That we do not 
have a random sample from the larger population does not affect our ability to con- 
sistently estimate the parameters of the model for the subpopulation. 

As a specific example that often leads to confusion, consider the conditional log- 
normal hurdle model that we discussed in Section 17.6.2. In that model, the distri- 
bution of log(y), conditional on y > 0 and the covariates x, is normally distributed 
with a linear conditional mean and constant variance. Therefore, the parameters in 
the model describing the y > 0 subpopulation, £ and a7, are consistently estimated 
using MLE (which is OLS in this case) on the subsample with y; > 0. Using a linear 
model for log( y) on the subsample y; > 0 does not work in the Exponential Type II 
Tobit model because that model implies that log( y) conditional on y > 0 is not log- 
normal, nor does it have a linear conditional mean. Some authors prefer to view the 
failure of the standard log(y) regression in the ET2T model as a sample selection 
problem, but that labeling misses the key point: in hurdle models, we are presumably 
interested in features of D(y | x), and the only issue is whether we have those features 
correctly specified. In the ET2T model, one often sees discussion of “sample selection 
bias” in estimating the vector # in the formulation y = 1[xy + v > 0] exp(xf + u). 
But $ does not by itself provide partial effects of any conditional mean involving vy, 
and so, as we discussed in Section 17.6.3, focusing on estimates of p in the ET2T 
model is inappropriate. (In the lognormal hurdle model, J indexes the semielasticities 
and elasticities of E(y| x, y > 0), which makes the J, of direct interest.) By contrast, 
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in a sample selection context with a linear regression for the underlying population, 
the focus on the single set of regression parameters is entirely appropriate, as we wll 
see in the next several sections. 

As a second example, suppose y is a fractional response that takes on the values 
zero and one with positive probability, and takes on a range of values strictly be- 
tween zero and one. One possibility is to model P(y = 0|x) and P(y = 1|x) along 
with E(y|x,0 < y < 1). Suppose the latter is E(y|x,0 < y < 1) = ©(xf). Then, to 
consistently estimate $ we can apply any of the consistent estimators for fractional 
responses in Section 18.6.1 to the subsample with 0 < y; < 1. We do not introduce a 
sample selection problem by ignoring the data with outcomes y; = 0 or y; = 1. 

Now that we know some contexts where sample selection is not an issue, we pro- 
vide some examples where nonrandom sampling is relevant and can (but does not 
always) cause serious problems. 


Example 19.1 (Saving Function): Suppose we wish to estimate a saving function for 
all families in a given country, and the population saving function is 


saving = By + B,income + page + B,married + B,kids + u, (19.28) 


where age is the age of the household head and the other variables are self-explanatory. 
However, we only have access to a survey that included families whose household 
head was 45 years of age or older. This restricted sampling raises a sample selection 
issue because we are interested in the saving function for all families, but we can 
obtain a random sample only for a subset of the population. 


Example 19.2 (Truncation Based on Wealth): We are interested in estimating the 
effect of worker eligibility in a particular pension plan (for example, a 401(k) plan) on 
family wealth. Let the population model be 


wealth = By + Biplan + P,educ + Page + Byincome + u, (19.29) 


where plan is a binary indicator for eligibility in the pension plan. However, we can 
only sample people with a net wealth less than $200,000, so the sample is selected on 
the basis of wealth. As we will see, sampling based on a response variable is much 
more serious than sampling based on an exogenous explanatory variable. 


In these two examples data were missing on all variables for a subset of the popu- 
lation as a result of survey design. In other cases, units are randomly drawn from the 
population, but data are missing on one or more variables for some units in the 
sample. Using a subset of a random sample because of missing data can lead to a 
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sample selection problem. As we will see, if the reason the observations are missing is 
appropriately exogenous, using the subsample has no serious consequences. 
Our final example illustrates a more subtle form of a missing data problem. 


Example 19.3 (Wage Offer Function): Consider estimating a wage offer equation 
for people of working age. By definition, this equation is supposed to represent all 
people of working age, whether or not a person is actually working at the time of the 
survey. Because we can only observe the wage offer for working people, we effectively 
select our sample on this basis. 

This example is not as straightforward as the previous two. We treat it as a sample 
selection problem because data on a key variable—the wage offer, wage°—are avail- 
able only for a clearly defined subset of the population. This is sometimes called 
incidental truncation because wage? is missing as a result of the outcome of another 
variable, labor force participation. 

The incidental truncation in this example has a strong self-selection component: 
people self-select into employment, so whether or not we observe wage? depends on 
an individual’s labor supply decision. Whether we call examples like this sample 
selection or self-selection is largely irrelevant. The important point is that we must 
account for the nonrandom nature of the sample we have for estimating the wage 
offer equation. 


In the next several sections we cover a variety of sample selection issues, including 
tests and corrections. 


19.4 When Can Sample Selection Be Ignored? 


In Section 19.3 we briefly discussed how there can be no sample selection problem if 
the model we have specified applies to a subsample of a population for which we can 
obtain a random sample. Here, we discuss a more substantive question: under what 
circumstances does using a nonrandom sample from a specified population never- 
theless consistently estimate the population parameters? 


19.4.1 Linear Models: Estimation by OLS and 2SLS 


We begin by obtaining conditions under which estimation of the population model 
by two-stage least squares (2SLS) using a selected sample is consistent for the popu- 
lation parameters. These results are of interest in their own right, but we will also 
apply them to several situations later in the chapter. 
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We assume there is a population represented by the random vector (x, y,z), where 
x isa 1 x K vector of explanatory variables, y is the scalar response variable, and z 
is a | x L vector of instrumental variables. The population model is the standard 
single-equation linear model with possibly endogenous explanatory variables: 


y= Pı + Poxr.+-+-+Bexx t+u=xPt+u (19.30) 
E(z’u) = 0, (19.31) 


where we take x; =1 for notational simplicity (an assumption that means zı is 
almost certainly equal to unity, too). From Chapter 5 we know that, if we could 
obtain a random sample from the population, then equation (19.31), along with the 
rank condition (particularly rank[E(z’x)] = K), would be sufficient to consistently 
estimate f. As we will see, in the context of sample selection, equation (19.31) is 
rarely sufficient for consistency of the 2SLS estimator on the selected sample. 

A leading special case is z = x, in which case the explanatory variables are as- 
sumed to be uncorrelated with the error. Our general treatment allows elements of x 
to be correlated with u. 

Rather than obtaining a random sample from the population, we only use data 
points that satisfy certain conditions. The idea is to think of drawing units randomly 
from the population, but now a random draw for unit i, (x;, y;,z;), is supplemented 
by drawing a selection indicator, s;. By definition, s; = 1 if unit i is used in the esti- 
mation, and s; = 0 if we do not use random draw i. Therefore, our “data” consist of 
{(X;, Yi Zi, si): i= 1,...,N}, where the value of s; determines whether we observe all 
of (Xi, Vi, Zi). 

Because parameter identification should always be studied in a population, we let 
s denote a random variable with the distribution of s; for all i. In other words, 
(x, y, Zz, s) now represents the population. Therefore, to determine the properties of 
any estimation procedure using the selected sample, we need to know about the dis- 
tribution of s and its dependence on (x, y, z). 

To obtain conditions under which 2SLS on the selected sample consistently esti- 
mates f, assume {(X;, Vi, Zi, Si): i= 1,..., N} is a random sample from the popula- 
tion. The 2SLS estimator using the selected sample can be written as 


N N / N =l N 
=|| N! XO siz}x; N`! Szin) (x 5 vas) 
i=l =I i=l 


=i 


N ú N =l N 
x | N7! X si2fx; N! X sini N`! XO sii y; f 
i=l i=l i=l 
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Substituting y; = xif + u; gives 


: N : N a N 
B=B+|| N7 Sozi (x Sozzi) (x Sozi 
i=l i=l i=l 


-1 


N í N -1 N 
x (m Sozi (x 5 van) (x Sozi) : (19.32) 
i=l i=l i=l 


It is easily seen from equation (19.32) that the key condition for consistency is 
E(s;z/u;) = 0, along with the rank condition on the selected sample. Formally, we 
have the following result: 


THEOREM 19.1 (Consistency of 2SLS under Sample Selection): In model (19.30), 
assume that E(u*) < œ, E(x?) < oo, j=1,...,K, and E(z?) < œ, j=1,...,L. 
Further, assume that 


E(sz'u) = 0 (19.33) 
rank E(z'z|s=1)=L (19.34) 
rank E(z'x|s = 1) = K (19.35) 


Then the 2SLS estimator using the selected sample is consistent for B and yN- 
asymptotically normal. 


Equation (19.32) essentially proves the consistency result under the assumptions of 
Theorem 19.1. Conditions (19.34) and (19.35) are fairly straightforward, and imply 
that the usual rank condition for 2SLS holds in the selected subpopulation. Natu- 
rally, it is possible that the rank condition holds on the entire population but not the 
s = 1 subset. For example, if s = 1 denotes females in the working population, then 
neither x nor z can include a female dummy variable. Generally, one needs sufficient 
variation in both the explanatory variables and instruments in the subpopulation in 
order for the rank condition to hold. 

It is worth studying condition (19.33) in some detail. First, it is not generally 
enough to assume condition (19.31), that is, zero correlation between the instruments 
and errors in the population. One case where condition (19.31) is sufficient occurs 
when s is independent of (z,u), so that E(sz'w) = E(s)E(z’u) = 0 if E(z’u) = 0. Of 
course, independence of selection and (z, u) is a very strong assumption. In the lead- 
ing case where z =x, so that estimation is by OLS, the condition is the same as 
independence between s and (x, y), which is tantamount to assuming that some 
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observations from an original random sample are dropped randomly, without regard 
to the values of (x;, y;). In the statistics literature, this has been called the missing 
completely at random (MCAR) assumption; see, for example, Little and Rubin 
(2002). 

More interesting are situations where selection can depend on exogenous variables, 
but not on the unobserved error. An important sufficient condition for assumption 
(19.33), easily verified by applying iterated expectations, is 


E(u|z,s) = 0. (19.36) 


Assumption (19.36) allows selection to be correlated with z but not with u, and has 
been called exogenous sampling. It is easier to interpret this label by looking at a 
special case, which strengthens the sense in which the instruments are exogenous in 
the population. Rather than equation (19.31), make the population zero-conditional- 
mean assumption 


E(u|z) = 0. (19.37) 


By basic properties of conditional expectations, if assumption (19.37) holds and s is a 
deterministic function of z, then assumption (19.36) holds (which means, of course, 
that assumption (19.33) holds). In other words, exogenous sampling occurs when 
s = h(z) for some nonrandom function /(-); that is, s is a nonrandom function of 
exogenous variables. But we must remember that the sense in which the instruments 
are exogenous is given by the stronger assumption (19.37). 

In the case with exogenous explanatory variables, condition (19.36) is equivalent to 


E(y|x,s) = E(y|x) = xf, (19.38) 


where the first equality is how we would define exogenous sampling in the context of 
regression analysis. Notice that assumption (19.38) implies consistency of OLS on the 
selected sample regardless of whether data are missing on y or elements of x, or both. 
Assumption (19.38) rules out selection mechanisms that depend on the unobserved 
factors affecting y in equation (19.30). This assumption is related to (a conditional 
mean version of) the missing at random (MAR) assumption (for example, Little and 
Rubin, 2002). MAR, which is often stated in terms of conditional distributions, 
further assumes that the variables determining selection (x in this case) are always 
observed. We will have more to say on this kind of assumption in Section 19.8 on 
inverse probability weighting. 

A sufficient condition for assumption (19.36) is that, conditional on z, u and s are 
independent, which is informatively written as D(s |z, u) = D(s|z), or, because s is 
binary, 
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P(s = 1|z,u) = P(s = 1 |z). (19.39) 


Because z is observable—at least for part of the population—and u is always unob- 
servable, condition (19.39) is an example of what has been dubbed selection on 
observables, although this concept is usually employed in settings where z is always 
observed, something not required for the statement of Theorem 19.1. Again, we will 
have more to say on this kind of assumption in Section 19.8, where sample selection 
can also be based on variables that appear outside the model specification. 

The asymptotic normality also follows in a fairly straightforward manner from 
equation (19.32); remember, the summands are i.i.d. random vectors, and the 
last term has a zero mean by assumption (19.33). Not surprisingly, the usual 
heteroskedasticity-robust variance matrix estimator, applied to the selected sample, is 
valid without further assumptions. 

If we add the homoskedasticity assumption E(u? |z, s) = E(u?) = ø? (which is the 
same as Var(u|z,s) = a? if we maintain assumption (19.36)) to the assumptions of 
Theorem 19.1, then we can show that the “usual” variance matrix estimator for 2SLS 
is asymptotically valid. Doing so requires two steps. First, under E(u? |z, s) = ø? the 
usual iterated expectations argument gives E(su?z'z) =o7E(sz'z). This equation can 
be used to show that Avar VN ( — p) = o2{E(sx'z)[E(sz'z)|'E(sz'x)}~'. The sec- 
ond step is to show that the usual 2SLS estimator of g? is consistent. This fact can 
be seen as follows. Under the homoskedasticity assumption, E(su*) = E(s)o*, where 
E(s) is just the fraction of the subpopulation in the overall population. The estimator 
of a? (without degrees-of-freedom adjustment) is 


N \7l Nn 
bs s S sú, (19.40) 
i=l i=! 


since $DA] s; is simply the number of observations in the selected sample. Removing 
the “^” from u? and applying the law of large numbers gives N~! Da “ps E(s) and 
N-! YN siu? > E(su2) = E(s)o2. Since the N~! terms cancel, expression (19.40) 
converges in probability to o?. 

If s is a function only of z, or s is independent of (z,u), and E(u? |z) = o?— 
that is, if the homoskedasticity assumption holds in the original population—then 
E(u? |z,s) = 07. As mentioned previously, without the homoskedasticity assumption 
we would just use the heteroskedasticity-robust standard errors, just as if a random 
sample were available with heteroskedasticity present in the population model. 

When x is exogenous and we apply OLS on the selected sample, Theorem 19.1 
implies that we can select the sample on the basis of the explanatory variables. 
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Selection based on y or on endogenous elements of x is not allowed because then 
E(u|x,s) # E(u). 


Example 19.4 (Nonrandomly Missing IQ Scores): As an example of how Theorem 
19.1 can be applied, consider the analysis in Griliches, Hall, and Hausman (1978) 
(GHH). The structural equation of interest is 


log(wage) = zı; + abil + v, E(v| z1, abil, IQ) = 0, 


and we assume that TỌ is a valid proxy for abil in the sense that abil = 0,10 + e and 
E(e | z1, IQ) = 0 (see Section 4.3.2). Write 


log(wage) = 216; + JO + u, (19.41) 


where u=v+e. Under the assumptions made, E(u |z, ZQ) = 0. It follows imme- 
diately from Theorem 19.1 that, if we choose the sample excluding all people with 
IQs below a fixed value, then OLS estimation of equation (19.41) will be consistent. 
This problem is not quite the one faced by GHH. Instead, GHH noticed that the 
probability of IQ missing was higher at lower IQs (because people were reluctant 
to give permission to obtain IQ scores). A simple way to model this situation is s = 1 
if JQ +r > 0, s = 0 if JQ +r < 0, where r is an unobserved random variable. If r is 
redundant in the structural equation and in the proxy variable equation for TQ, that 
is, if E(v | zı, abil, IQ,r) = 0 and E(e|z,,/O,r) = 0, then E(u|z,,/O,r) = 0. Since s 
is a function of JQ and r, it follows immediately that E(u | z1, ZQ, s) = 0. Therefore, 
using OLS on the sample for which /Q is observed yields consistent estimators. 

If r is correlated with either v or e, E(u|z,,/O,s) # E(u) in general, and OLS 
estimation of equation (19.41) using the selected sample would not consistently esti- 
mate 6; and 6. Therefore, even though /Q is exogenous in the population equation 
(19.41), the sample selection is not exogenous. In Section 19.6.2 we cover a method 
that can be used to correct for sample selection bias. 


Theorem 19.1 has other useful applications. Suppose that x is exogenous in equa- 
tion (19.30) and that s is a nonrandom function of (x,v), where v is a variable not 
appearing in equation (19.30). If (u, v) is independent of x, then E(u|x,v) = E(u | v), 
and so 


E(y |x) = xB + E(u|x,v) = x$ + E(u | v). 


If we make an assumption about the functional form of E(u|v), for example, 
E(u|v) = yv, then we can write 


y=xP+yvte, E(e|x,v) = 0, (19.42) 
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where e = u — E(u | v). Because s is just a function of (x, v), E(e | x, v, s) = 0, and so $ 
and y can be estimated consistently by the OLS regression y on x, v, using the 
selected sample. Effectively, including v in the regression on the selected subsample 
eliminates the sample selection problem and allows us to consistently estimate £. 
(Incidentally, because v is independent of x, we would not have to include it in 
equation (19.30) to consistently estimate } if we had a random sample from the 
population. However, including v would result in an asymptotically more efficient 
estimator of f when Var(y|x,v) is homoskedastic. See problem 4.5.) In Section 19.7 
we will see how equation (19.42) can be implemented when v depends on unknown 
parameters. 


19.4.2 Nonlinear Models 


Results similar to those in the previous section hold for nonlinear models as well. We 
will cover explicitly the case of nonlinear regression and maximum likelihood. See 
problem 19.11 for the GMM case. 

In the nonlinear regression case, if E(y|x,s) = E(y|x)—so that selection is exog- 
enous in the conditional mean sense—then NLS on the selected sample is consistent. 
Sufficient is that s is a deterministic function of x. The consistency argument is sim- 
ple: NLS on the selected sample solves 


N 
min N`! si| Y; — (Xi, A 19.43 

il 2 [yi — m(xi,B)] (19.43) 
so it suffices to show that £, in E(y| x) = m(x, B,) minimizes E{s|y — m(x, B)|°} over 
B. By iterated expectations, 


E{s[y — m(x, B)]”} = E(sE{[y — m(x, B)]”| x, 5}) 


Next, write [y—m/(x, B)? =u? +2[m(x, B,) —m(x, B)|u+ [m(x, Ba) —m/(x, B)]’, where 
u = y—m(x, B,). By assumption, E(u |x, s) = 0. Therefore, 


E{[y — m(x, p)? |x, 5} = E(u? |x, s) + [m(x, B.) — m(x, B)]’, 


and the second term is clearly minimized at $ = B,. We do have to assume that f, is 
the unique value of £ that makes E{s|m(x, B) — m(x, B,)|°} zero. This is the identifi- 
cation condition on the subpopulation. 

It can also be shown that, if Var(y|x,s) = Var(y|x) and Var(y|x) = 2, then the 
usual, nonrobust NLS statistics are valid. If heteroskedasticity exists either in the 
population or the subpopulation, standard heteroskedasticity-robust inference can be 
used. The arguments are very similar to those for 2SLS in the previous subsection. 
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Another important case is the general conditional maximum likelihood setup. 
Assume that the distribution of y given x and s is the same as the distribution of 
y given x: D(y|x,s) = D(y |x). This is a stronger form of ignorability of selection, 
but it always holds if s is a nonrandom function of x, or if s is independent of (x, y). 
In any case, D(y|x,s) = D(y|x) ensures that the MLE on the selected sample is 
consistent and that the usual MLE statistics are valid. The analogy argument should 
be familiar by now. Conditional MLE on the selected sample solves 


N 
-1 

max N 2 si (Yp X0), (19.44) 
where /(y;, X;; 0) is the log likelihood for observation i. Now for each x, 0, maximizes 
E|/(y, x; 8)|x] over 0. But E[s/(y, x; 0)] = E{sE[/(y, x; 0) |x, s]} = E{sE[@(y, x; 0) |x]}, 
since, by assumption, the conditional distribution of y given (x, s) does not depend on 
s. Since E[¢(y, x; 0) | x] is maximized at 0,, so is E{sE[¢(y, x; @) | x]}. We must make 
the stronger assumption that 9, is the unique maximum, just as in the previous cases: 
if the selected subset of the population is too small, we may not be able to identify 0%. 
Inference can be carried out using the usual MLE statistics obtained from the 
selected subsample because the information equality now holds conditional on x and 
s under the assumption that D(y|x,s) = D(y| x). We omit the details. 

Problem 19.11 asks you to work through the case of GMM estimation of general 
nonlinear models based on conditional moment restrictions. 


19.5 Selection on the Basis of the Response Variable: Truncated Regression 


Let (x;, y;) denote a random draw from a population. In this section we explicitly 
treat the case where the sample is selected on the basis of y;. 

In applying the following methods it is important to remember that there is an 
underlying population of interest, often described by a linear conditional expectation: 
E(y;|x;) = x;8. If we could observe a random sample from the population, then we 
would just use standard regression analysis. The problem comes about because the 
sample we can observe is chosen at least partly based on the value of y,. Unlike in the 
case where selection is based only on x;, selection based on y; causes problems for 
standard OLS analysis on the selected sample. 

A classic example of selection based on y; is Hausman and Wise’s (1977) study of 
the determinants of earnings. Hausman and Wise recognized that their sample from a 
negative income tax experiment was truncated because only families with income 
below 1.5 times the poverty level were allowed to participate in the program; no data 
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were available on families with incomes above the threshold value. The truncation 
rule was known, and so the effects of truncation could be accounted for. 

A similar example is example 19.2. We do not observe data on families with wealth 
above $200,000. This case is different from the top coding example we discussed in 
Section 19.2.3. Here, we observe nothing about families with high wealth: they are 
entirely excluded from the sample. In the top coding case, we have a random sample 
of families, and we always observe x;; the information on x; is useful even if wealth is 
top coded. 

We assume that y, is a continuous random variable and that the selection rule 
takes the form 


si = lla, < y; < a], 


where a, and a, are known constants such that ay < a). A good way to think of the 
sample selection in this case is that we draw (x;, y;) randomly from the population. If 
y; falls in the interval (aı, a2), then we observe both y; and x;. If y; is outside this 
interval, then we do not observe y; or x;. Thus, all we know is that there is some 
subset of the population that does not enter our data set because of the selection rule. 
We know how to characterize the part of the population not being sampled because 
we know the constants a, and ap. 

In most applications we are still interested in estimating E( y; | x;) = x;B. However, 
because of sample selection based on y,;, we must—at least in a parametric context— 
specify a full conditional distribution of y; given x;. Parameterize the conditional 
density of y; given x; by f(- |x; ß,y), where f} are the conditional mean parameters 
and y isa G x 1 vector of additional parameters. The cdf of y; given x; is F(-|x;;B, 7). 

What we can use in estimation is the density of y, conditional on x; and the 
fact that we observe (y;,x;). In other words, we must condition on a; < y; < da or, 
equivalently, s; = 1. The cdf of y; conditional on (x;, s; = 1) is simply 


P(y; < y, 5; = 1 |x;) 
P(s; = 1|x;) 


Ply; < y |Xpnsi = 1) = 


Because y; is continuously distributed, P(s;=1|x;)=P(aı < y; <a|x:)=F(a|x;ß,y) 
— F(aı |x; ß,y) > 0 for all possible values of x;. The case a) = œ corresponds to 
truncation only from below, in which case F (az | x;; B, y) = 1. If a; = — œ (truncation 
only from above), then F(a | x;; $, y) = 0. To obtain the numerator when a < y < 
a2, we have 


PCy; < ysi = 1| xi) = Pla < yi < yx) = Fly xip, y) — Fla | x58, y). 
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When we put this equation over P(s; = 1|x;) and take the derivative with respect to 
the dummy argument #4, we obtain the density of y; given (x;, s; = 1): 


f(y x; B?) (19.45) 


ply|Xi si = 1) = 
(l )= aA a 


for ay < y < a. 

Given a model for f(y|x;ß,y), the log-likelihood function for any (x;, y;) in the 
sample can be obtained by plugging y; into equation (19.45) and taking the log. The 
CMLEs of f and y using the selected sample are efficient in the class of estimators 
that do not use information about the distribution of x;. Standard errors and test 
statistics can be computed using the general theory of conditional MLE. 

In most applications of truncated samples, the population conditional distribution 
is assumed to be Normal(xf, a7), in which case we have the truncated Tobit model or 
truncated normal regression model. The truncated Tobit model is related to the cen- 
sored Tobit model for data-censoring applications (see Section 19.2.3), but there is 
a key difference: in censored regression, we have a random sample of units and we 
observe the covariates x for all people, even those for whom the response is not 
known. If we drop observations entirely when the response is not observed, we obtain 
the truncated regression model. If in a top coding example we use the information 
in the top coded observations, we are in the censored regression case. If we drop all 
top coded observations, we are in the truncated regression case. (Given a choice, we 
should use a censored regression analysis, as it uses all of the information in the 
sample.) 

From our analysis of the censored regression model in Section 19.2.3, it is not 
surprising that heteroskedasticity or nonnormality in truncated regression results 
in inconsistent estimators of f. This outcome is unfortunate because, if not for the 
sample selection problem, we could consistently estimate # under E(y|x) = xf, 
without specifying Var(y |x) or the conditional distribution. Distribution-free meth- 
ods for the truncated regression model have been suggested by Powell (1986) under 
the assumption of a symmetric error distribution; see Powell (1994) for a recent 
survey. 

Truncating a sample on the basis of y is related to choice-based sampling. Tradi- 
tional choice-based sampling applies when y is a discrete response taking on a finite 
number of values, where sampling frequencies differ depending on the outcome of y. 
(In the truncation case, the sampling frequency is one when y falls in the interval 
(a,,a2) and zero when y falls outside of the interval.) We do not cover choice-based 
sampling here; see Manksi and McFadden (1981), Imbens (1992), and Cosslett 
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(1993). In Section 20.2 we cover some estimation methods for stratified sampling, 
which can be applied to some choice-based samples. 


19.6 Incidental Truncation: A Probit Selection Equation 


We now turn to sample selection corrections when selection is determined by a probit 
model. This setup applies to problems different from those in Section 19.5, where the 
problem was that a survey or program was designed to intentionally exclude part of 
the population. We are now interested in selection problems that are due to incidental 
truncation, attrition in the context of program evalution, and general nonresponse 
that leads to missing data on the response variable or the explanatory variables. 


19.6.1 Exogenous Explanatory Variables 


The incidental truncation problem is motivated by Gronau’s (1974) model of the 
wage offer and labor force participation. 


Example 19.5 (Labor Force Participation and the Wage Offer): Interest lies in esti- 
mating E(w? |x;), where w? is the hourly wage offer for a randomly drawn individual 
i. If w? were observed for everyone in the (working age) population, we would pro- 
ceed in a standard regression framework. However, a potential sample selection 
problem arises because w? is observed only for people who work. 

We can cast this problem as a weekly labor supply model: 


max util;(w?h + a;,h) subject to 0 < h < 168, (19.46) 


where / is hours worked per week and a; is nonwage income of person i. Let s;(h) = 
util;(w?h + a;, h), and assume that we can rule out the solution A; = 168. Then the 
solution can be h; = 0 or 0 < h; < 168. If ds;/dh < 0 at h = 0, then the optimum is 
h; = 0. Using this condition, straightforward algebra shows that h; = 0 if and only if 


w? < —mu} (ar, 0) /mu; (ai, 0), (19.47) 


where mu/'(-,-) is the marginal disutility of working and mu{(-,-) is the marginal 


utility of income. Gronau (1974) called the right-hand side of equation (19.47) the 
reservation wage, w;, which is assumed to be strictly positive. 
We now make the parametric assumptions 


we = exp(xif) + ui), w; = exp(Xj2P> + yoai + ui2) (19.48) 


where (uii, ui2) is independent of (x;1,x;2,q;). Here, xX; contains productivity char- 
acteristics, and possibly demographic characteristics, of individual 7, and x; contains 
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variables that determine the marginal utility of leisure and income; these may overlap 
with x;;. From equation (19.48) we have the log wage equation 


log w? = xB, + ui. (19.49) 
But the wage offer w? is observed only if the person works, that is, only if w? > w7, or 
log w? — log w; = xB, — Xi2B) — y24i + Ui — Uin = Xj02 + vi > 0. 


This behavior introduces a potential sample selection problem if we use data only on 
working people to estimate equation (19.49). 


This example differs in an important respect from top coding examples. With top 
coding, the censoring rule is known for each unit in the population. In Gronau’s ex- 
ample, we do not know w7, so we cannot use w? in a censored regression analysis. If 
w; were observed and exogenous and x; were always observed, then we would be in 
the censored regression framework with censoring from below. If w/ were observed 
and exogenous but x;; were observed only when w? is, we would be in the truncated 
Tobit framework. But w; is allowed to depend on unobservables, and so we need a 
new framework. 

If we drop the 7 subscript, let y; = log w°, and let y, be the binary labor force 
participation indicator, Gronau’s model can be written for a random draw from the 
population as 


yı =xip, +m, (19.50) 
y2 = 1[xdz + v2 > 0]. (19.51) 
We discuss estimation of this model under the following set of assumptions: 


ASSUMPTION 19.1: (a) (x, v2) are always observed, yı is observed only when y, = 1; 
(b) (w,v2) is independent of x with zero mean; (c) v2 ~ Normal(0,1); and (d) 
E(u; | v2) = y) 02. 


Assumption 19.la emphasizes the sample selection nature of the problem. Part b is 
a strong, but standard, form of exogeneity of x. We will see that assumption 19.1c is 
needed to derive a conditional expectation given the selected sample. It is probably 
the most restrictive assumption because it is an explicit distributional assumption. 
Assuming Var(v2) = 1 is without loss of generality because y, is a binary variable. 

Assumption 19.1d requires linearity in the population regression of u on v2. 
It always holds if (w,v2) is bivariate normal—a standard assumption in these 
contexts—but assumption 19.1d holds under weaker assumptions. In particular, we 
do not need to assume that u; itself is normally distributed. 
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Amemiya (1985) calls equations (19.50) and (19.51) the Type II Tobit model. This 
name is fine as a label, but we must understand that it is a model of sample selection, 
and it has nothing to do with y, being a corner solution outcome. Unfortunately, in 
almost all treatments of this model, y; is set to zero when y, = 0. Setting yı to zero 
(or any value) when y, = 0 is misleading and can lead to inappropriate use of the 
model. For example, it makes no sense to set the wage offer to zero just because we 
do not observe it. As another example, it makes no sense to set the price per dollar of 
life insurance (y,) to zero for someone who did not buy life insurance (so y, = 1 if 
and only if a person owns a life insurance policy). 

We also have some interest in the parameters of the selection equation (19.51); for 
example, in Gronau’s model it is a reduced-form labor force participation equation. 
In program evaluation with attrition, the selection equation explains the probability 
of dropping out of the program. 

Because of the similarities in the statistical structures for the Type II Tobit model 
for sample selection and the exponential Type II Tobit model for corner solutions, 
now is a good time to review the differences in how one applies these models. Recall 
that when using the ET2T for a corner solution response, the goal is to model the 
distribution of the fully observable outcome in a flexible way. The possible outcomes 
include zero—which is a legitimate value and means we need to model the proba- 
bility of a response at zero, as well as the density for positive values. One can debate 
about which model is best, but the important point is that there is no missing data 
problem. (It is for that reason that using the label “selection model” for the hurdle 
model we discussed in Section 17.6.3 is somewhat misleading.) 

The present situation is different from having a corner at zero. In the sample se- 
lection setting, the variable yı in equation (19.50), which often is in logarithmic form 
(in which case the setup appears quite similar to the ET2T model), is not always 
observed. We are interested in the mean response in the population, E( yı | x1), which 
is assumed to follow a standard linear model. Consequently, once the parameters 
B, have been consistently estimated, it is trivial to interpret the estimated equation 
because the elements of f, directly measure the partial effects of interest. In other 
words, in the sample selection context the parameter estimates are directly interpret- 
able as partial effects on E( yı | x1), unlike in the ET2T case, where the various con- 
ditional means of the observed response have no simple forms. 

We can allow a little more generality in the model by replacing x in equation 
(19.51) with x2; then, as will become clear, x; would only need to be observed 
whenever yı is, whereas x must always be observed. This extension is not especially 
useful for something like Gronau’s model because it implies that x; contains elements 
that cannot also appear in x2. Because the selection equation is not typically a 


Censored Data, Sample Selection, and Attrition 805 


structural equation, it is undesirable to impose exclusion restrictions in equation 
(19.51). If a variable affecting yı is observed only along with y,, the instrumental 
variables method that we cover in Section 19.6.2 is more attractive. 

To derive an estimating equation, let (y1, y2, X, u1, V2) denote a random draw from 
the population. Since yı is observed only when y, = 1, what we can hope to estimate 
is E(y, |x, y2 = 1) [along with P(y, = 1|x)]. How does E(y, |x, y2 = 1) depend on 
the vector of interest, fı? First, under assumption 19.1 and equation (19.50), 


E(y1 |x, v2) = x18) + E(u |x, v2) = x18, + E(u | v2) = X11 + 7102, (19.52) 


where the second equality follows because (u,v) is independent of x. Equation 
(19.52) is very useful. The first thing to note is that, if y,; = 0—which implies that u 
and v2 are uncorrelated—then E(y, | x, v2) = E(y, |x) = E(y | x1) = x18). Because 
yz is a function of (x, v2), it follows immediately that E(y, |x, y2) = E(y,|x1). In 
other words, if y} = 0, then there is no sample selection problem, and f} can be 
consistently estimated by OLS using the selected sample. 

What if y, # 0? Using iterated expectations on equation (19.52), 


E(y1 |X, y2) = X18) + yi E(v2 |x, y2) = x18) + A(x, y2), 


where A(x, ya) = E(v2|x, y2). If we knew A(x, y,), then, from Theorem 19.1, we 
could estimate f; and y, from the regression y; on x; and A(x, y»), using only the 
selected sample. Because the selected sample has y, = 1, we need only find A(x, 1). 
But A(x, 1) = E(v2|v2 > —xd2) = 2(xd2), where 2(-) = ¢(-)/®(-) is the inverse Mills 
ratio, and so we can write 


E(y, |X, y2 = 1) = xi, + 714x82). (19.53) 


Equation (19.53), which can be found in numerous places (see, for example, Heckman, 
1979, and Amemiya, 1985) makes it clear that an OLS regression of y, on x; using 
the selected sample omits the term 2(xd2) and generally leads to inconsistent estima- 
tion of B,. As pointed out by Heckman (1979), the presence of selection bias can be 
viewed as an omitted variable problem in the selected sample. An interesting point is 
that, even though only x; appears in the population expectation, E(y, |x), other ele- 
ments of x appear in the expectation on the subpopulation, E(y, |x, yə = 1). 

Equation (19.53) also suggests a way to consistently estimate $}. Following 
Heckman (1979), we can consistently estimate f; and y, using the selected sample by 
regressing y; On xj, A4(x;d2). The problem is that 62 is unknown, so we cannot 
compute the additional regressor /(xj;d2). Nevertheless, a consistent estimator of 6 is 
available from the first-stage probit estimation of the selection equation. 
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Procedure 19.1: (a) Obtain the probit estimate 6) from the model 


P( yin = 1| xi) = O(xi62) 19.54) 


( 
using all N observations. Then obtain the estimated inverse Mills ratios dn = 2(xj02) 
(at least for i = 1,..., 1). 
(b) Obtain £; and 7, from the OLS regression on the selected sample, 
Ya on Xa, Ân, i= 1,2,...,M. (19.55) 
These estimators are consistent and vV N-asymptotically normal. 


The procedure is sometimes called Heckit after Heckman (1976) and the tradition of 
putting “it” on the end of procedures related to probit (such as Tobit). 

A very simple test for selection bias is available from regression (19.55). Under 
the null of no selection bias, Ho: y; = 0, we have Var(y, |x, ya = 1) = Var(y, |x) = 
Var(u;), and so homoskedasticity holds under Ho. Further, from the results on gen- 
erated regressors in Chapter 6, the asymptotic variance of 7, (and Â) is not affected 
by ô when yı = 0. Thus, a standard ż test on 9, is a valid test of the null hypothsesis 
of no selection bias. 

When y; # 0, obtaining a consistent estimate for the asymptotic variance of Ê is 
complicated for two reasons. The first is that, if y; # 0, then Var( y; |X, yə = 1) is not 
constant. As we know, heteroskedasticity itself is easy to correct for using the robust 
standard errors. However, we should also account for the fact that 6> is an estimator 
of 6. The adjustment to the variance of (f,,#,) because of the two-step estimation is 
cumbersome—it is not enough to simply make the standard errors heteroskedasticity- 
robust. Some statistical packages now have this feature built in. 

As a technical point, we do not need x; to be a strict subset of x for f4 to be 
identified, and procedure 19.1 does carry through when x; = x. However, if x;67 does 
not have much variation in the sample, then Â;2 can be approximated well by a linear 
function of x. If x = x4, this correlation can introduce severe collinearity among the 
regressors in regression (19.55), which can lead to large standard errors of the ele- 
ments of Ê. When x; = x, f;, is identified only due to the nonlinearity of the inverse 
Mills ratio. 

The situation is not quite as bad as in Section 9.5.1. There, identification failed for 
certain values of the structural parameters. Here, we still have identification for any 
value of $; in equation (19.50), but it is unlikely we can estimate #, with much pre- 
cision. Even if we can, we would have to wonder whether a statistically significant 
inverse Mills ratio term is due to sample selection or functional form misspecification 
in the population model (19.50). 
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Table 19.1 
Wage Offer Equation for Married Women 


Dependent Variable: log(wage) 


Independent Variable OLS Heckit 
educ 108 109 
(.014) .016) 
exper 042 044 
(.012) .016) 
exper? —.00081 —.00086 
(.00039) .00044) 
constant —.522 —.578 
` (.199) .307) 
A — .032 
.134) 
Sample size 428 428 
R-squared .157 .157 


Example 19.6 (Wage Offer Equation for Married Women): We use the data in 
MROZ.RAW to estimate a wage offer function for married women, accounting for 
potential selectivity bias into the workforce. Of the 753 women, we observe the wage 
offer for 428 working women. The labor force participation equation contains the 
variables in Table 15.1, including other income, age, number of young children, and 
number of older children—in addition to educ, exper, and exper?. The results of OLS 
on the selected sample and the Heckit method are given in Table 19.1. 

The differences between the OLS and Heckit estimates are practically small, and 
the inverse Mills ratio term is statistically insignificant. The fact that the intercept 
estimates differ somewhat is usually unimportant. (The standard errors reported for 
Heckit are the unadjusted ones from regression (19.55). If Ay were statistically sig- 
nificant, we should obtain the corrected standard errors.) 

The Heckit results in Table 19.1 use four exclusion restrictions in the structural 
equation, because nwifeinc, age, kidslt6, and kidsge6 are all excluded from the wage 
offer equation. If we allow all variables in the selection equation to also appear in the 
wage offer equation, the Heckit estimates become very imprecise. The coefficient on 
educ becomes .119 (se = .034), compared with the OLS estimate .100 (se = .015). The 
coefficient on kids/te—which now appears in the wage offer equation—is —.188 
(se = .232) in the Heckit estimation, and —.056 (se = .009) in the OLS estimation. 
The imprecision of the Heckit estimates is due to the severe collinearity that comes 
from adding Ay to the equation, because Ay is now a function only of the explanatory 
variables in the wage offer equation. In fact, using the selected sample, regressing Ay on 
the seven explanatory variables gives R-squared = .962. Unfortunately, comparing 
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the OLS and Heckit results does not allow us to resolve some important issues. For 
example, the OLS results suggest that another young child reduces the wage offer by 
about 5.6 percent (f statistic ~ —6.2), other things being equal. Is this effect real, or is 
it simply due to our inability to adequately correct for sample selection bias? Unless 
we have a variable that affects labor force participation without affecting the wage 
offer, we cannot answer this question. 


If we replace parts c and d in assumption 19.1 with the stronger assumption that 
(uj, V2) is bivariate normal with mean zero, Var(u)) = ais Cov(w), v2) = a2, and 
Var(v2) = 1, then partial maximum likelihood estimation can be used, as described 
generally in problem 13.7. Partial MLE will be more efficient than the two-step pro- 
cedure under joint normality of u; and v2, and it will produce standard errors and 
likelihood ratio statistics that can be used directly (this conclusion follows from 
problem 13.7). The drawbacks are that it is less robust than the two-step procedure 
and that it is sometimes difficult to get the problem to converge. 

The reason we cannot perform full conditional MLE—contrast the situation 
in Section 17.6.3 for the ET2T model—is that y, is only observed when y, = 1. 
Thus, while we can use the full density of y, given x, which is f(y,|x) = 
[®(x6)]”?[1 — ®(x6d)]'~?, ya = 0, 1, we can only use the density f(y, | y2,x) when 
yo = 1. To find f(y, | y.,x) at y) = 1, we can use Bayes’ rule to write f(y, | y2,x) = 
Fal 1X) F111 X)/F (v2 |x). Therefore, f(y, | y2=1,x)=P(2=1] y, x) (11 )/ 
P(y)=1|x). But yı |x ~ Normal(x:f,,¢7). Further, y)=1[xé2 + o1207° (yı — xif;) 
+e) > 0], where e) is independent of (x, y,) and ez ~ Normal(0, 1 — o7,0;7) (this 
conclusion follows from standard conditional distribution results for joint normal 
random variables). Therefore, 


P», = 1| y1,x) = @{ [xd + onor? (yi — xpi) — 2,057) 1}. 
Combining all of these pieces [and noting the cancellation of P(y, = 1|x)] we get 
£(0) = (1 — yn) log[l — ®(x,62)] + yo(log ®{[xib2 + o12077(Yi — xab) 

x (1 = ahar) 7} + log (yi — XB1)/o1] — log(ai)). 


The partial log likelihood is obtained by summing /;(0) across all observations; 
Yi = 1 picks out when y; is observed and therefore contains information for esti- 
mating f. 

Ahn and Powell (1993) show how to consistently estimate £; without making any 
distributional assumptions; in particular, the selection equation need not have the 
probit form. Vella (1998) contains a useful survey. 
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19.6.2 Endogenous Explanatory Variables 


We now study the sample selection model when one of the elements of x; is thought 
to be correlated with uw). Or, all the elements of x; are exogenous in the population 
model but data are missing on an element of xı, and the reason data are missing 
might be systematically related to u1. For simplicity, we focus on the case of a single 
endogenous explanatory variable. Having multiple endogenous explanatory variables 
adds no complications beyond the usual identification conditions. 

The model in the population is 


Yı = 210; + Hy. + u1 (19.56) 
y2= 220? + U2 (19.57) 
y= 1[z63 + v3 > 0]. (19.58) 


The first equation is the structural equation of interest, the second equation is a 
linear projection for the potentially endogenous or missing variable y,, and the third 
equation is the selection equation. We allow arbitrary correlation among ui, v2, 
and v3. 

The setup in equations (19.56)—(19.58) encompasses at least three cases of interest. 
The first occurs when y, is always observed but is endogenous in equation (19.56). 
An example is seen when y, is log(wage®) and y, is years of schooling: years of 
schooling is generally available whether or not someone is in the workforce. The 
model also applies when y, is observed only along with y,, as would happen if y; = 
log(wage®) and y, is the ratio of the benefits offer to wage offer. As a second exam- 
ple, let yı be the percentage of voters supporting the incumbent in a congressional 
district, and let y, be intended campaign expenditures. Then y, = 1 if the incumbent 
runs for reelection, and we only observe (y1, y2) when y3 = 1. A third application is 
to missing data only on y, as in example 19.4 where y, is IQ score. In the last two 
cases, y, might in fact be exogenous in equation (19.56), but endogenous sample 
selection effectively makes y, endogenous in the selected sample. 

If y; and y, were always observed along with z, we would just estimate equation 
(19.56) by 2SLS if y, is endogenous. We can use the results from Section 19.4.1 to 
show that 2SLS with the inverse Mills ratio added to the regressors is consistent. 
Regardless of the data availability on y; and y,, in the second step we use only 
observations for which both y, and y, are observed. 


ASSUMPTION 19.2: (a) (z, y3) is always observed, (y,, y2) is observed when y, = l; 
(b) (u1, v3) is independent of z; (c) v3 ~ Normal(0, 1); (d) E(w | v3) = y,v3; and (e) 
E(z}v2) = 0 and, writing 2202 = 11022 + 221022, 022 # 0. 
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Parts b, c, and d are identical to the corresponding assumptions in assumption 19.1 
when all explanatory variables are observed and exogenous. Assumption e is new, 
resulting from the endogeneity of y, in equation (19.56). It is important to see that 
assumption 19.2e is identical to the rank condition needed for identifying equation 
(19.56) in the absence of sample selection. 

The relationship between z2 and z is somewhat subtle. The assumptions allow the 
possibility that z2 = z, but it is helpful to think of z including at least one factor that 
“primarily” affects selection that is not also in z2. This way, we can think of needing 
at least one instrument for y2—as in the case without selection—and then at least 
one more exogenous variable that affects selection. This thinking forces discipline on 
us when we (potentially) have the problem of an endogenous explanatory variable 
and sample selection. 

One can choose z2 = z, but then we should have at least two elements in z that are 
not in zı. As we will see, choosing z) = z means that the (implicit) reduced form for 
yz will tend to suffer from collinearity because then the inverse Mills ratio will be 
a function of the same variables, z, appearing linearly as regressors. Unless we are 
interested in the reduced form parameters, the collinearity introduced by having the 
same regressors appearing linearly as those appearing in the inverse Mills ratio is not 
of much concern—remember, 2SLS uses only the fitted values from the reduced 
form—but it is possible the small-sample performance of the procedure is affected. 

Importantly, if we choose zz to be a strict subset of z, we are not making exclusion 
restrictions in the reduced form; we are simply choosing between variables that are 
viewed as instruments for y versus variables that affect selection. By contrast, we 
usually do not want to make exclusion restrictions in the selection equation, as the 
subsequent procedure is inconsistent if those restrictions are violated. Therefore, it is 
a good idea to choose z to be the vector of all exogenous variables. 

To derive an estimating equation, write (in the population) 


Yı = 1101 + HY + G(Z, y3) + 61, (19.59) 


where g(z, y3) = E(u; | z, y3) and ey = u — E(u | z, y3). By definition, E(e; | z, y3) = 
0. If we knew g(z, y3) then, from Theorem 19.1, we could just estimate equation 
(19.59) by 2SLS on the selected sample (y, = 1) using instruments |z, g(z, 1)]. It turns 
out that we do know g(z,1) up to some estimable parameters: E(u |Z, y3 = 1) = 
y,4(z63). Since 63 can be consistently estimated by probit of y, on z (using the entire 
sample), we have the following: 


Procedure 19.2: (a) Obtain 63 from probit of y, on z using all observations. Obtain 
the estimated inverse Mills ratios, 4; = 1(z;63). 
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(b) Using the selected subsample (for which we observe both y, and y,), estimate 
the equation 
Ya = 2101 + UY + yds + error; (19.60) 
by 2SLS, using instruments (z;2, Ap). 

The steps in this procedure show that identification actually requires that z2 
appear in the linear projection of y, onto Zi, Z2, and A(zd3) in the selected sub- 
population. It would be unusual if this condition were not true when the rank con- 
dition 19.2e holds in the population. 

The hypothesis-of-no-selection problem (allowing y, to be endogenous or not), 
Ho: yı = 0, is tested using the usual 2SLS ¢ statistic for #,. When y; # 0, standard 
errors and test statistics should be corrected for the generated regressors problem, as 
in Chapter 6, or using the bootstrap (with subsamples drawn from the complete list 
of observations). 


Example 19.7 (Education Endogenous and Sample Selection): In example 19.6 we 
now allow educ to be endogenous in the wage offer equation, and we test for sample 
selection bias. Just as if we did not have a sample selection problem, we need IVs 
for educ that do not appear in the wage offer equation. As in example 5.3, we use 
parents’ education (motheduc, fatheduc) and husband’s education as IVs. In addition, 
we need some variables that affect labor force participation but not the wage offer; 
we use the same four variables as in example 19.6. Therefore, all variables except 
educ (and, of course, the wage offer) are treated as exogenous. 

We include all of the exogenous variables in the labor force participation equation: 
exper, exper”, nwifeinc, age, kidslt6, kidsge6, motheduc, fatheduc, and huseduc (not 
educ). (As it turns out, the three education variables are marginally jointly significant, 
with p-value = .046.) If we add the inverse Mills ratio, 43, to the equation and esti- 
mate it by 2SLS, using all of the exogenous variables as instruments—in addition 
to 4;—the coefficient on Âs is .040 (se = .133). Therefore, there is little evidence 
of sample selection bias. Further, the estimated coefficient on education is .088 
(se = .021), which is very close to the estimate when we drop the IMR: the 2SLS 
coefficient without the IMR included is .087 (se = .021). Thus, there is no practical 
consequence of correcting for selection bias. 

If we use only the parents’ and husband’s education variables as instruments for 
educ, the coefficient on Â; becomes .036 (se = .134), and so the evidence for selection 
is still nonexistent. The education coefficient changes somewhat, to .081 (se = .022), 
but this is due to the different list of instruments, not the selection correction. With- 
out A; in the equation, the 2SLS estimate using the smaller list of instruments is .080 
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(se = .022). Therefore, in this data set, the estimated return to education changes 
depending on whether education is allowed to be endogenous, and then somewhat 
depending on the instruments used for it. But whether a sample selection correction is 
used has essentially no effect. 


Importantly, procedure 19.2 applies to any kind of endogenous variable y,, in- 
cluding binary and other discrete variables, without any additional assumptions. 
Why? Because the reduced form for y, is just a linear projection; we do not have to 
assume, for example, that vz is normally distributed or even independent of z. As an 
example, we might wish to look at the effects of participation in a job training pro- 
gram on the subsequent wage offer, accounting for the fact that not all who partici- 
pated in the program will be employed in the following period (y, is always observed 
in this case). If participation is voluntary, an instrument for it might be whether the 
person was randomly chosen as a potential participant. 

In order for the selection correction to work for y with a variety of distributions, 
it is important to carry out the estimation exactly as described in procedure 19.2. It is 
tempting to first obtain fitted values, say ,, insert these for y2, and then apply the 
Heckman regression. (If y is always observed, one would probably just estimate a 
linear reduced form using all the observations; if y2 is missing along with yı, one 
would probably use a standard Heckman correction on the reduced form.) The 
problem with using fitted values in place of y2, and then, say, running the regression 
yin ON Zi, fp, Âg on the selected sample, is that this procedure places strong 
assumptions on the reduced form error, v2 in equation (19.57). In effect, it adds —a v2 
to the structural error, u;, with the result that we would have to assume, at a mini- 
mum, that u; — v2 is independent of z and that E(w) — «1v2 | v3) is linear. Such 
assumptions are virtually impossible unless y2 is continuously distributed. Of course, 
joint normality of (u, v2, v3) is sufficient, but this approach is much more restrictive 
than assumption 19.2. Thus, applying 2SLS to equation (19.60) is much preferred to 
first replacing y2 with fitted values. 

Even if y, is exogenous in the population equation (19.56), when y, is sometimes 
missing we generally need an instrument for y, when selection is not ignorable (that 
is, E(w | 21, ¥2, ¥3) # E(u1)). In example 19.4 we could use family background vari- 
ables and another test score, such as KWW, as IVs for JQ, assuming these are always 
observed. We would generally include all such variables in the reduced-form selection 
equation. Procedure 19.2 works whether we assume /Q is a proxy variable for ability 
or an indicator of ability (see Chapters 4 and 5). 

To illustrate the likely consequences of using the same variable as an instrument 
for y2 and to predict selection, reconsider example 19.7. As we already know, the 
number of young children predicts a lower probability of labor force participation. In 
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this data set, education has a positive and statistically significant partial correlation 
with kidslt6. Therefore, we might try applying procedure 19.2 with kids/t6 the only 
element in z2 = z that is not in zı. That is, the probit for in/f includes exper, exper’, 
and kidsIt6, and then the the instrumental variables are exper, exper”, âz, and kidslt6. 
Using this approach, the estimated education coefficient is .329 (se = .188), which is 
much too large to be plausible. Further, the 95% confidence interval (which ignores 
the two-step estimation) includes zero, showing the effects of the severe collinearity in 
the IV estimation. 

If we make stronger assumptions, it is possible to estimate model (19.56)—(19.58) 
by partial maximum likelihood of the kind discussed in problem 13.7. One possibility 
is to assume that (1, v2, v3) is trivariate normal and independent of z. In addition to 
ruling out discrete y,, such a procedure would be computationally difficult. If y, is 
binary, we can model it as y, = 1[Z.d. + v2 > 0], where vz|z ~ Normal(0, 1). But 
maximum likelihood estimation that allows any correlation matrix for (uw, v2, 03) is 
complicated and less robust than procedure 19.2. 


19.6.3 Binary Response Model with Sample Selection 


We can estimate binary response models with sample selection if we assume that 
the latent errors are bivariate normal and independent of the explanatory variables. 
Write the model as 


y = 1[x:£; +u > 0] (19.61) 
yy = I[xdy + v2 > 0, (19.62) 


where the second equation is the sample selection equation and y; is observed only 
when y, = 1; we assume that x is always observed. For example, suppose yı is an 
employment indicator and x; contains a job training binary indicator (which we 
assume is exogenous), as well as other human capital and family background vari- 
ables. We might lose track of some people who are eligible to participate in the pro- 
gram; this is an example of sample attrition. If attrition is systematically related to u, 
estimating equation (19.61) on the sample at hand can result in an inconsistent esti- 
mator of f). 

If we assume that (u1, v2) is independent of x with a zero-mean normal distribution 
(and unit variances), we can apply partial maximum likelihood. What we need is the 
density of yı conditional on x and y, = 1. We have essentially found this density in 
Chapter 15: in equation (15.55) set « = 0, replace z with x, and replace 6; with $4. 
The parameter p, is still the correlation between u; and v2. A two-step procedure can 
be applied: first, estimate ô by probit of y, on x. Then, estimate f; and p; in the 
second stage using equation (15.55) along with P(y, =0|x, y, = 1). 
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A convincing analysis requires at least one variable in x—that is, something that 
determines selection—that is not also in xı. Otherwise, identification is off of the 
nonlinearities in the probit models. 

Importantly, some simple strategies for “correcting” for sample selection are not 
valid. For example, on one hand, it is tempting to estimate the selection equation by 
probit and then plug the estimated inverse Mills ratio into the second-stage probit, 
using only the observations with y = 1. There is no way to justify this as a sample 
selection correction. On the other hand, inserting the IMR into the second-stage 
probit is a legitimate test of the null hypothesis of no selection bias. One can start 
from equation (15.55) with « = 0, z= x, zı = x1, and 6; = f}, which is E(y; |x, 
yo = 1). Fixing 6)—because we subsequently insert its probit estimate—we can 
compute the gradient of the mean function with respect to (£1, p1)’, and then insert 
pı = 0 (which holds under the null). If m(x,B),p,;62) denotes the mean function, 
then it can be shown that Vg m(x, B,,0;62) = ¢(x1B,)x; and V, m(x, B,,0; 02) = 
@(X1P,)A(xd2), where 2(-) is the IMR. If we use these gradients in the score tests 
developed in Section 15.5.3, we obtain a simple score statistic for sample selection. 
Actually, the variable addition version of the test is simpler. In the first step, estimate 
ô by probit of yn on x;, using all of the observations. (Under the null, 6> is the MLE 
because there is no sample selection.) Construct the IMRs, de = 1(x;62). Next, using 
the observations for which yj. = 1 (that is, for which yj; is observed), run probit of 
Vit ON Xj, În and use the usual z statistic on Â» to test the null hypothesis Ho: p; = 0. 
Because the coefficient on da is zero under the null, there is no need to adjust the ¢ 
statistic for the first-stage estimation; see Section 12.4.2. 

Allowing for endogenous explanatory variables in equation (19.61) along with 
sample selection is difficult, and it is a useful area of future research. 


19.6.4 An Exponential Response Function 


Another nonlinear model for which it is easy to obtain a simple test for selection bias, 
and relatively easy to correct for sample selection bias, is the exponential response 
model. The mechanics are very similar to those in Section 18.5 for an exponential 
response function and a binary endogenous explanatory variable. In particular, we 
still use equation (18.48) but only for the subsample with y; = 1 (the selected sam- 
ple). The two-step method proposed by Terza (1998), which consists of estimating a 
probit in the first stage followed by nonlinear regression (or a quasi-MLE, such as the 
Poisson or exponential), consistently estimates the parameters. A simple test of sam- 
ple selection bias is obtained by adding the log of the inverse Mills ratio, log[A(z;62)], 
to the exponential function, and estimating the resulting “model” by, say, the Poisson 
QMLE using the selected sample. The robust f statistic for log|A(z;62)] that allows the 
likelihood to be misspecified is a valid test of the null hypothesis of no selection bias. 
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19.7 Incidental Truncation: A Tobit Selection Equation 


We now study the case where more information is available on sample selection, 
primarily in the context of incidental truncation. In particular, we assume that selec- 
tion is based on the outcome of a Tobit, rather than a probit, equation. The model in 
Section 19.7.1 is a special case of the model studied by Vella (1992) in the context of 
testing for selectivity bias. 


19.7.1 Exogenous Explanatory Variables 


We now consider the case where the selection equation is of the censored Tobit form. 
The population model is 


Yı = xpi tu (19.63) 
yı = max(0, xd2 + v2), (19.64) 


where (x, >) is always observed in the population but yı is observed only when 
yə > 0. A standard example occurs when y; is the log of the hourly wage offer and y, 
is weekly or annual hours of labor supply. 


ASSUMPTION 19.3: (a) (x, y2) is always observed in the population, but y, is observed 
only when y, > 0; (b) (w,v2) is independent of x; (c) vz ~ Normal(0, 73); and (d) 
E(w | v2) = y) 02. 


These assumptions are very similar to the assumptions for a probit selection equa- 
tion. The only difference is that v2 now has an unknown variance, since y, is a cen- 
sored as opposed to binary variable. 

Amemiya (1985) calls equations (19.63) and (19.64) the Type III Tobit model, but 
we emphasize that equation (19.63) is the structural population equation of interest 
and that equation (19.64) simply determines when y, is observed. In the labor eco- 
nomics example, we are interested in the wage offer equation, and equation (19.64) is 
a reduced-form hours equation. As with a probit selection equation, it makes no 
sense to define y, to be, say, zero, just because we do not observe y}. 

The starting point is equation (19.52), just as in the probit selection case. Now 
define the selection indicator as s2 = 1 if y, > 0, and s2 = 0 otherwise. Since sz is a 
function of x and v3, it follows immediately that 


E(y; |X, 02,52) = X11 + 7102. (19.65) 


This equation means that, if we could observe v2, then an OLS regression of y; on xj, 
v2 using the selected subsample would consistently estimate (£4, y1), as we discussed 
in Section 19.4.1. While vz cannot be observed when y, = 0 (because when y, = 0, 
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we only know that v2 < —x0d2), for y, > 0, v2 = y) — xd. Thus, if we knew 62, we 
would know vz whenever y, > 0. It seems reasonable that, because 62 can be con- 
sistently estimated by Tobit on the whole sample, we can replace v2 with consistent 
estimates. 


Procedure 19.3: (a) Estimate equation (19.64) by standard Tobit using all N obser- 
vations. For y; > 0 (say i= 1,2,...,1), define 


Big = Vin — Xidd. (19.66) 
(b) Using observations for which y; > 0, estimate $4, yı by the OLS regression 
Vil on Xj, bj2 i=1,2,...,M. (19.67) 


This regression produces consistent, /N-asymptotically normal estimators of £} and 
y, under assumption 19.3. 


The statistic to test for selectivity bias is just the usual ¢ statistic on 0,2 in regression 
(19.67), which was suggested by Vella (1992). 

It seems likely that there is an efficiency gain over procedure 19.1. If v2 were known 
and we could use regression (19.67) for the entire population, there would definitely 
be an efficiency gain: the error variance is reduced by conditioning on v2 along with 
x, and there would be no heteroskedasticity in the population. See problem 4.5. 

Unlike in the probit selection case, x; = x causes no problems here: v2 always has 
separate variation from x; because of variation in y,. We do not need to rely on the 
nonlinearity of the inverse Mills ratio. 


Example 19.8 (Wage Offer Equation for Married Women): We now apply proce- 
dure 19.3 to the wage offer equation for married women in example 19.6. (We assume 
education is exogenous.) The only difference is that the first-step estimation is Tobit, 
rather than probit, and we include the Tobit residuals as the additional explanatory 
variables, not the inverse Mills ratio. In regression (19.67), the coefficient on 62 is 
—.000053 (se = .000041), which is somewhat more evidence of a sample selection 
problem, but we still do not reject the null hypothesis Ho: y; = 0 at even the 15 per- 
cent level against a two-sided alternative. Further, the coefficient on educ is .103 
(se = .015), which is not much different from the OLS and Heckit estimates. (Again, 
we use the usual OLS standard error.) When we include all exogenous variables in 
the wage offer equation, the estimates from procedure 19.3 are much more stable 
than the Heckit estimates. For example, the coefficient on educ becomes .093 
(se = .016), which is comparable to the OLS estimates discussed in example 19.6. 
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For partial maximum likelihood estimation, we assume that (u,v2) is jointly 
normal, and we use the density for f(y,|x) for the entire sample and the con- 
ditional density f(y; |x, ¥2,52 = 1) = f(y; |X, y2) for the selected sample. This 
approach is fairly straightforward because, when y» > 0, yı |x, yo ~ Normal[xif, + 
¥1(¥2 — X62), 7], where y? = o? — of, /t2, o? = Var(u), and o12 = Cov(u, v2). The 
log likelihood for observation i is 


(0) = sn log f (Ya | Xi, Vi23 9) + log f (Yn | Xi; 02, T3), (19.68) 


where f(ya |X; ¥j239) is the Normal[xif; + yı(Vi2 — xid2),77] distribution, eval- 
uated at yj, and f(y; | x;;62,74) is the standard censored Tobit density (see equa- 
tion (17.19)). As shown in problem 13.7, the usual MLE theory can be used even 
though the log-likelihood function is not based on a full conditional density. 

It is possible to obtain sample selection corrections and tests for various other 
nonlinear models when the selection rule is of the Tobit form. For example, suppose 
that the binary variable y, given z follows a probit model, but it is observed only 
when y, > 0. A valid test for selection bias is to include the Tobit residuals, #2, in a 
probit of yı on z, #2 using the selected sample; see Vella (1992). This procedure also 
produces consistent estimates (up to scale), as can be seen by applying the maximum 
likelihood results in Section 19.4.2 along with two-step estimation results. 

Honoré, Kyriazidou, and Udry (1997) show how to estimate the parameters of the 
Type II Tobit model without making distributional assumptions. 


19.7.2 Endogenous Explanatory Variables 


We explicitly consider the case of a single endogenous explanatory variable, as in 
Section 19.6.2. We use equations (19.56) and (19.57), and, in place of equation 
(19.58), we have a Tobit selection equation: 


y3 = max(0, zô; + v3). (19.69) 


ASSUMPTION 19.4: (a) (z, y3) is always observed, (yı, y2) is observed when y, > 0; 
(b) (u1, v3) is independent of z; (c) v3 ~ Normal(0, 73); (d) E(w | v3) = 7,03; and (e) 
E(z’v2) =0 and, writing ZO = 7021 + 270, 0 Ff 0. 


Again, these assumptions are very similar to those used with a probit selection 
mechanism. 
To derive an estimating equation, write 


Yı = 210, + HV + 7103 + €l, (19.70) 
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where e; =u; — E(u; | v3). Since (e1,v3) is independent of z by assumption 19.4b, 
E(e; |z, v3) = 0. From Theorem 19.1, if v3 were observed, we could estimate equation 
(19.70) by 2SLS on the selected sample using instruments (z, v3). As before, we can 
estimate v3 when y, > 0, since 03 can be consistently estimated by Tobit of y, on z 
(using the entire sample). 


Procedure 19.4: (a) Obtain ô; from Tobit of y, on z using all observations. Obtain 
the Tobit residuals ĉ;3 = y; — z;d3 for yz > 0. 
(b) Using the selected subsample, estimate the equation 


Ya = 2101 + HY +7163 + error; (19.71) 


by 2SLS, using instruments (z;,é;3). The estimators are V/N-consistent and asymp- 
totically normal under assumption 19.4. 


Comments similar to those after procedure 19.2 hold here as well. Strictly speak- 
ing, identification really requires that z2 appear in the linear projection of y, onto z4, 
z2, and v3 in the selected subpopulation. The null of no selection bias is tested using 
the 2SLS ¢ statistic (or maybe its heteroskedasticity-robust version) on 0,3. When 
yı £0, standard errors should be corrected using two-step methods. 

As in the case with a probit selection equation, the endogenous variable y, can be 
continuous, discrete, a corner solution, and so on. Extending the method to multiple 
endogenous explanatory variables is straightforward. The only restriction is the usual 
one for linear models: we need enough instruments to identify the structural equa- 
tion. See problem 19.9 for an application to the Mroz data. 

An interesting special case of model (19.56), (19.57), and (19.69) occurs when 
Yı = y3. Actually, because we only use observations for which y, > 0, y, = y3 is 
also allowed, where y} = zd3 + v3. Either way, the variable that determines selection 
also appears in the structural equation. This special case could be useful when sample 
selection is caused by a corner solution outcome on y, (in which case y, = y, is 
natural) or because y} is subject to data censoring (in which case y, = y} is more 
realistic). An example of the former occurs when y, is hours worked and we assume 
hours appears in the wage offer function. As a data-censoring example, suppose that 
yı is a measure of growth in an infant’s weight starting from birth and that we observe 
yı only if the infant is brought into a clinic within three months. Naturally, birth 
weight depends on age, and so y;—length of time between the first and second mea- 
surements, which has quantitiative meaning—appears as an explanatory variable in 
the equation for y,. We have a data-censoring problem for ył, which causes a sample 
selection problem for y,. In this case, we would estimate a censored normal regres- 
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sion model for y, [or, possibly, log(y3)] to account for the data censoring. We would 
include the residuals ô; = y; — z;d3 in equation (19.71) for the noncensored obser- 
vations. As our extra instrument we might use distance from the child’s home to the 


clinic. 
19.7.3 Estimating Structural Tobit Equations with Sample Selection 


We briefly show how a structural Tobit model can be estimated using the methods of 
the previous section. As an example, consider the structural labor supply model 


log(w°) = 2B; + u (19.72) 
h = max(0, z2fy + %2 log(w®) + u2]. (19.73) 


This system involves simultaneity and sample selection because we observe w° only if 
h> 0. 
The general form of the model is 


yy =P, +u (19.74) 
Yı = max(0, 22Py + «2 yı + u2). (19.75) 


ASSUMPTION 19.5: (a) (z, y) is always observed; yı is observed when y, > 0; (b) 
(u1, u2) is independent of z with a zero-mean bivariate normal distribution; and (c) zı 
contains at least one element whose coefficient is different from zero that is not in z2. 


As always, it is important to see that equations (19.74) and (19.75) constitute a 
model describing a population. If y) were always observed, then equation (19.74) 
could be estimated by OLS. If, in addition, uy and u2 were uncorrelated, equation 
(19.75) could be estimated by type I Tobit. Correlation between u; and u2 could be 
handled by the methods of Section 17.5.2. Now, we require new methods, whether or 
not u; and u are uncorrelated, because y, is not observed when y, = 0. 

The restriction in assumption 19.5c is needed to identify the structural parameters 
(B5,%2) (B; is always identified). To see that this condition is needed, and for finding 
the reduced form for yy, it is useful to introduce the latent variable 


Vz = Z2ßn + Hy, + 2 (19.76) 


so that y, = max(0, yž). If equations (19.74) and (19.76) make up the system of 
interest—that is, if y} and y; are always observed—then $, is identified without 
further restrictions, but identification of « and p, requires exactly assumption 19.5c. 
This turns out to be sufficient even when y, follows a Tobit model and we have 
nonrandom sample selection. 
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The reduced form for yž is y3 = zd2 + v2. Therefore, we can write the reduced 
form of equation (19.75) as 


yı = max(0, zd + v2). (19.77) 


But then equations (19.74) and (19.77) constitute the model we studied in Section 
17.6.1. The vector d2 is consistently estimated by Tobit, and $; is estimated as in 
procedure 19.3. The only remaining issue is how to estimate the structural parameters 
of equation (19.75), % and £». In the labor supply case, these are the labor supply 
parameters. 

Assuming identification, estimation of (a, $2) is fairly straightforward after having 
estimated f}. To see this point, write the reduced form of y, in terms of the structural 
parameters as 


y2 = max(0, 22Py + %2(z2181) + v2]. (19.78) 


Under joint normality of u; and u, v2 is normally distributed. Therefore, if $4 were 
known, ß, and a could be estimated by standard Tobit using z2 and z)f, as regres- 
sors. Operationalizing this procedure requires replacing f; with its consistent esti- 
mator. Thus, using all observations, J, and a are estimated from the Tobit equation 


Yiz = max(0, 2:28, + 02 (zi B;) + error’). (19.79) 
To summarize, we have the following: 


Procedure 19.5: (a) Use procedure 19.3 to obtain f,. 
(b) Obtain f, and å from the Tobit in equation (19.79). 


In applying this procedure, it is important to note that the explanatory variable in 
equation (19.79) is zB, for all i. These are not the fitted values from regression 
(19.67), which depend on ;2. Also, it may be tempting to use y; in place of zaf; for 
that part of the sample for which y,, is observed. This approach is not a good idea: 
the estimators are inconsistent in this case. 

The estimation in equation (19.79) makes it clear that the procedure fails if zı does 
not contain at least one variable not in z2. If zı is a subset of z2, then Z; Ê is a linear 
combination of z;2, and so perfect multicollinearity will exist in equation (19.79). 

Estimating Avar(é,f,) is even messier than estimating Avar(B,), since (êz, f») 
comes from a three-step procedure. Often just the usual Tobit standard errors and 
test statistics reported from equation (19.79) are used, even though these are not 
strictly valid. By setting the problem up as a large GMM problem, as illustrated in 
Chapter 14, correct standard errors and test statistics can be obtained. Bootstrapping 
can be used, too. 
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Under assumption 19.5, a full maximum likelihood approach is possible. In fact, 
the log-likelihood function can be constructed from equations (19.74) and (19.78), 
and it has a form very similar to equation (19.68). The only difference is that non- 
linear restrictions are imposed automatically on the structural parameters. In addi- 
tion to making it easy to obtain valid standard errors, MLE is desirable because it 
allows us to estimate oj = Var(u2), which is needed to estimate average partial effects 
in equation (19.75). 

In examples such as labor supply, it is not clear where the elements of zı that are 
not in z2 might come from. One possibility is a union binary variable, if we believe 
that union membership increases wages (other factors accounted for) but has no effect 
on labor supply once wage and other factors have been controlled for. This approach 
would require knowing union status for people whether or not they are working in 
the period covered by the survey. In some studies past experience is assumed to affect 
wage—which it certainly does—and is assumed not to appear in the labor supply 
function, a tenuous assumption. 


19.8 Inverse Probability Weighting for Missing Data 


We now turn to a different method for correcting for general missing data problems, 
inverse probability weighting (IPW). Compared with Heckman-type approaches, 
IPW applies very generally to any estimation problem that involves minimization 
or maximization. However, the assumptions under which IPW produces consistent 
estimators of the population parameters are quite different from those used in 
Heckman-type methods. We highlight the differences in this section, which follows 
the M-estimation setup in Chapter 12 and in Wooldridge (2007). 

Again, we characterize missing data using a binary selection indicator, s;. There- 
fore, a random draw from the population consists of (w;, s;), and all or part of w; is 
not observed if s; = 0. We are interested in estimating 0%, the solution to the popu- 
lation problem 
min Elq(w;, 0)], (19.80) 
0cO 
where q(w,-) is the objective function for given w. If we use the selected sample to 
estimate 0, we solve 


N 
in N~! q(w;, 0). 19.81 
min N D898) (19.81) 
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We call the solution to this problem the unweighted M-estimator, 6, to distinguish 
it from the weighted estimator that will be introduced later. We have already seen 
examples, particularly for the linear model, where Ô, is not consistent for 0%. 

Inverse probability weighting is a general approach to nonrandom sampling that 
dates back to Horvitz and Thompson (1952). IPW has been used more recently for 
regression models with missing data (for example, Robins, Rotnitzky, and Zhou 
1995) and in the treatment effects literature (Hirano, Imbens, and Ridder (2003) and 
Chapter 21). The key is that we have some variables that are “good” predictors of 
selection, something we make precise in the following assumption. 


ASSUMPTION 19.6: (a) The vector w; is observed whenever s; = 1. 
(b) There is a random vector z; such that P(s; = 1 | w;,z;) = P(s; = 1 |z;) = p(z;). 
(c) For allze ¥ CR’, p(z) > 0. 
(d) z; is observed when s; = 1. 


When z; is always observed, assumption 19.6 is essentially the missing at random 
assumption mentioned in Section 19.4. Wooldridge (2007) shows that allowing z; to 
be unobserved when s; = 0 allows coverage of certain stratified sampling schemes, 
which we treat in Chapter 20, and IPW solutions to censored duration data, which 
we cover in Chapter 22. One also sees assumption 19.6b described as ignorable 
selection (conditional on z;). In the treatment effects literature—see Chapter 21— 
assumption 19.6b is essentially the unconfoundedness assumption. 

Assumption 19.6 encompasses the selection on observables assumption mentioned 
in Section 19.4. In the general M-estimation framework, selection on observables 
typically applies when w; partitions as (x;, y;), x; is always observed but y; is not, and 
z; is a vector that is always observed and includes x;. Then, s; is allowed to be a 
function of observables z;, but s; cannot be related to unobserved factors affecting 
y,;. Assumption 19.6 does not apply to the selection on unobservables case, at least 
as that terminology has been used in econometrics. Selection on unobservables is 
essentially what we covered in Sections 19.5 and 19.6, where we explicitly allowed 
selection to be correlated with unobservables after conditioning on exogenous vari- 
ables. Heckman’s (1976) solution to the incidental truncation problem, which applies 
to linear models and a few others, requires at least one exogenous variable that 
affects selection but does not have a partial effect on the outcome. In assumption 
19.6, the z; should have properties different from exogenous variables that are used 
for identification in the Heckman approach. In effect, the z; should be good proxies 
for the unobservables that affect y, and also determine selection. Sometimes, Z; 
includes outcomes on y, and even x; from previous time periods. 
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As an example, consider the linear model for a random draw, y; = xif, + ui, 
E(x/u;) = 0, and suppose data are missing on y; when s; = 0. Let z; be a vector of 
variables that includes x;, and, for concreteness, assume that s; = 1[zjd, + vi > 0], 
where v; is independent of z;. To use a Heckman selection correction we would as- 
sume E(z/u;) = 0 but allow arbitrary correlation between v; and u;, so that selection is 
correlated with u; even after netting out z;. By contrast, assumption 19.6b is the same 
as P(s; = 1|z;,u;) = P(s; = 1|z;), which essentially means that u; and v; should be 
independent. Thus, under assumption 19.6b, we assume that z; is such a good pre- 
dictor of selection that, conditional on z;, s; is independent of the unobservables in 
the regression equation, u;. However, z; and u; can be arbitrarily correlated. 

If we could observe the selection probabilities then, under assumption 19.6, solving 
the missing data problem would be easy. To see why, let g(w) be any scalar function 
such that the mean, u = E[g(w,)], exists. Then, using iterated expectations, 


Elsig(wi) /p(zi)| = E{Elsig(wi) /p(z:) | wi, zi] } 
= E{E(s; | wi, 2;)9(wi)/p(zi) } 
= E{P(s; = 1 | wi, z:)9(wi) /p(zi)} 
= E{p(zi)g(wi)/p(zi)} = Elg(wi)], (19.82) 


where the last equality follows from P(s; = 1 | w;,z;) = p(z;). This result shows that 
the population mean of any function of w; can be recovered by weighting the selected 
observation by the inverse of the probability of selection. It follows immediately that 
a consistent (actually, unbiased) estimator of u is N~! 5O; [s:g(w:)/p(z;)]. Actually, a 


somewhat more common estimator, based on the fact that E[s;/p(z;)] = 1, is 


N 7 N 
Lipw = (Eeveen) (Erara); (19.83) 


i=1 i=1 


which is a weighted average of the sampled data where the weights add to one. While 
equation (19.83) appears to depend on N (the number of times the population was 
sampled), N is not needed to compute rpp. The sampling weights implicit in equa- 
tion (19.83) are often reported in survey data to obtain means in the presence of 
missing data. 

We can now see how to use IPW estimation in the context of M-estimation. The 


IPW estimator, 0,,, solves 


N 
min No! Dbsifeada( 0). (19.84) 
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From the previous argument, the mean of each summand is E[q(w;,0)|, which is 
minimized at 0,, and so, under the mild conditions of the uniform weak law of large 
numbers, 6,, is consistent for 0,; see Theorem 12.2. 

In most cases where the selection probabilities p(z;) are not known, z; is assumed 
to be always observed, so that a model for P(s; = 1|z;) can be estimated by binary 
response maximum likelihood. Here we consider the case where we use a binary re- 
sponse model for P(s; = 1|z;), which requires that z; is always observed. 


ASSUMPTION 19.7: (a) G(z,y) is a parametric model for p(z), where y e F c IR™ and 
G(z,y) > 0, all z and y. 

(b) There there exists y, € I such that p(z) = G(z, y,). 

(c) ? is the binary response maximum likelihood estimator, and the regularity 
conditions for MLE hold such that 


N 
VN(9— vo) = {Eldi(y,)di(7.)]} (meS ao) + op(1), (19.85) 
i=l 


where d;(y) = V)G(z, y)'[si — G(z, y)]/{G(z, y)[1 — G(z, y)]} is the M x 1 score vector 
for the MLE. 


Given 7, we can form G(z;,7) for all i with s; = 1, and then obtain the weighted 
M-estimator, 0,,, by solving 


N 
min N! Dal Gla law 9). (19.86) 


Replacing the unknown probability p(z) = G(z,y,) with G(z;,) does not affect 
consistency of Ô, under the general conditions for two-step estimators; see Section 
12.4.1. More interesting is finding the asymptotic distribution of VN (ô, — 9,). 

The following result assumes that the objective function q(w,-) is twice con- 
tinuously differentiable on the interior of ©, as in Section 13.10.2. Write r(w;, 0) = 
Voq(w;,9)' as the Px 1 score of the unweighted objective function, H(w,0) = 
V,4(w, 0) as the P x P Hessian of q(w;,0), and k(s;, zi, Wi, 7,9) = [s:/G(z;, y)|r(wi, 9) 
as the selected, weighted score function; in particular, k(s;,z;,wi,y,0) is zero when- 
ever s; = 0. 

It is easily shown that the conditions of the “surprising” efficiency result in Section 
13.10.2 hold. Therefore, 


VN(6, — 0) ~ Normal(0, A,'D,A,'), (19.87) 
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where A, = E[H(wj,9,)|, D, = E(e;e!), e; = k; — E(kjd!)(E(djd’)|~'d;, and k; and d; 
are evaluated at (y,, 0o) and y,, respectively. Further, consistent estimators of A, and 
D,, respectively, are 


N 

A=N!'Y [s;/G(zi,)|H (wi, Oy) (19.88) 
i=l 

and 

E N 

D=N') êê;, (19.89) 


where the ê; = k; — (N~! YA kid!)(N~! SA, âid’) 'd; are the P x 1 residuals from 


the multivariate regression of k; on d;, ¿= 1,...,N, and all hatted quantities are 
evaluated at » or 0,. The asymptotic variance of 6,, is consistently estimated as 
A'DAT/N. 


We can compare expression (19.87) with the asymptotic variance that we would 
obtain by using a known value of y, in place of the conditional MLE, 7. As before, let 
O, denote the estimator that uses 1/G(z;,y,) as the weights. Then 


VN(6, — 0) © Normal(0,A>'B,A,!), (19.90) 


where B, = E(k;k;). Because B, — D, is positive semidefinite, Avar VN (ð, — bo) — 
Avar VN(ĝ, — 0,) is positive semidefinite. 

As an example, consider the linear regression model y = xf, + u, E(x'u) = 0, and 
suppose the estimated probabilities p; = G(z;,ĵ) are from a logit estimation. The 
gradient for the logit estimation is d/ = z;[s; — A(z;,9)], a 1 x M vector (the dimen- 
sion of z;, which includes a constant). The selected, weighted gradient for the linear 
regression problem is k! = 5;x;ti;/p;, where 0; = yi — xB, are the residuals after the 
IPW estimation. The adjustment to the asymptotic variance is obtained by getting 
the (row) residuals ê; from the regression s;x;i#;/p,; on z;[s; — A(z;, ĵ)] using all obser- 
vations. Then, D is constructed as in expression (19.90), and, in this case, A = 
N`! pau ,(5;/P;)X;X;. The asymptotic variance of B,, is estimated as 


N “lyn N ml 
e Jais (>: at) Ee Jais . (19.91) 
i=l i=l i=l 
The conservative estimate would replace ê; with s;x/a;/p,, in which case the estimator 


looks just like a “heteroskedasticity’’-robust sandwich estimator in the context of 
weighted least squares. 
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As with all general efficiency results, the conclusion that a difference in asymptotic 
variances is positive semidefinite does not guarantee an efficiency gain in particular 
cases: the statement includes the possibility that there is no difference in the asymp- 
totic variances. 

One case where there is no efficiency gain from using the estimated probabilities 
occurs when the missing data mechanism is, in a particular sense, exogenous. Con- 
sider the linear regression model. If we impose the assumption 


E(u|x,z) = 0, (19.92) 


which now means that E( y|x,z) = E(y |x) = xf, and P(s = 1|x, y,z) = P(s = 1 |z), 
then the asymptotic variance is given by expression (19.90) whether or not we esti- 
mate the weights. Further, the probability weights can come from a misspecified 
estimation problem without affecting consistency. For maximum likelihood prob- 
lems, the exogeneity condition is D(y|x,z) = D(y |x). An important situation where 
this condition and assumption (19.92) hold occurs when z= x and the model for 
D(y|x) or E(y|x) is correctly specified; the exogeneity condition on selection is 
P(s = 1| x, y) = P(s = 1|x). 

Not surprisingly, in the general case of exogenous sampling, the unweighted esti- 
mator is also consistent. Along with assumption 19.6b, we generally define exogenous 
selection in the context of M-estimation as follows. Assume 6, satisfies 
0, = argmin Eļq(w, 0) |z] (19.93) 

0cO 
for all outcomes z. This requirement may seem abstract, but it holds for the usual 
estimation methods when the appropriate underlying feature of the conditional dis- 
tribution is correctly specified and z is exogenous in the model. For example, for 
nonlinear least squares and quasi-MLE in the linear exponential family, assumption 
(19.93) holds whenever E( y |x,z) = E(y|x) and the latter is correctly specified. This 
condition holds for conditional MLE, too. 

Wooldridge (2007, Theorem 4.3) shows that, under exogenous sampling, the 
unweighted estimator is more efficient than the weighted estimator under a version of 
the conditional information matrix equality. Generally, the condition is stated as 


E[Voq(w, 00)'Voq(w, 9.) |z] = 0, E[V54(W, 90) | z], (19.94) 


which holds under assumption (19.93) under suitable regularity conditions for cor- 
rectly specified CMLE, nonlinear least squares under homoskedasticity, and quasi- 
MLE under the assumption that the GLM variance assumption holds. The usual 
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nonrobust variance matrix estimators on the selected sample are valid. See Woold- 
ridge (2007) for further discussion. 

The previous results help to inform the decision of when weighting should and 
should not be used. If features of an unconditional distribution, say D(w), are of in- 
terest, unweighted estimators consistently estimate the parameters only if P(s = 1 | w) 
= P(s = 1)—tthat is, the data are missing completely at random (Rubin, 1976). Of 
course, consistency of the weighted estimator relies on the presence of z such that 
P(s = 1|w,z) = P(s = 1 |z), but maintaining such an assumption is often one’s only 
recourse. 

The decision to weight is more subtle when we begin with the premise that some 
feature of a conditional distribution, D(y|x), is of interest. Wooldridge (2007) 
describes several scenarios concerning both consistent estimation and efficient esti- 
mation. Here, we briefly discuss a problem that can arise with weighting if data are 
missing on some of the conditioning variables, x. The issue arises because, in general, 
if data are missing on elements of x—due to attrition, say, or nonresponse—these 
elements generally cannot be included in the selection predictors, z. But then suppose 
that selection is entirely a function of x: P(s = 1|x,y) = P(s = 1 |x) (even though 
elements of x are not observed). Then the unweighted estimator is consistent if 
(the feature of) D(y| x) is correctly specified. Generally, if z omits any element of x, 
the weighted estimator would actually be inconsistent. In effect, the wrong weights 
are used because the condition P(s=1|x,y,z) = P(s=1|z) cannot hold. See 
Wooldridge (2007) for additional discussion. 

When z can be chosen to include x, the case for weighting is much stronger: if 
selection does depend only on x, that fact will be picked up in large enough samples 
if the model for P(s = 1|z) is sufficiently flexible. Although this approach covers 
the case of treatment effects estimation—see Chapter 21—it does not cover general 
missing data problems. Therefore, one has to be cautious when using IPW when 
some conditioning variables are missing: important differences between the weighted 
and unweighted estimates cannot necessarily be attributed to a problem with the 
unweighted estimator, and we can never know why the two estimates are different 
(unless we have access to a random sample). 

When the selection probabilities reflect stratified sampling and are determined 
from the sample design, the case for weighting is also stronger; see Chapter 20. 


19.9 Sample Selection and Attrition in Linear Panel Data Models 


In our treatment of panel data models we have assumed that a balanced panel is 
available—each cross section unit has the same time periods available. Often, some 
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time periods are missing for some units in the population of interest, and we are left 
with an unbalanced panel. Unbalanced panels can arise for several reasons. First, the 
survey design may simply rotate people or firms out of the sample based on pre- 
specified rules. For example, if a survey of individuals begins at time ¢ = 1, at time 
t = 2 some of the original people may be dropped and new people added. At ¢ = 3 
some additional people might be dropped and others added; and so on. This is an 
example of a rotating panel. 

Provided the decision to rotate units out of a panel is made randomly, unbalanced 
panels are fairly easy to deal with, as we will see shortly. A more complicated prob- 
lem arises when attrition from a panel is due to units electing to drop out. If this de- 
cision is based on factors that are systematically related to the response variable, even 
after we condition on explanatory variables, a sample selection problem can result— 
just as in the cross section case. Nevertheless, a panel data set provides us with 
the means to handle, in a simple fashion, attrition that is based on a time-constant, 
unobserved effect, provided we use first-differencing methods; we show this in Section 
19.9.3. 

A different kind of sample selection problem occurs when people do not disappear 
from the panel but certain variables are unobserved for at least some time periods. 
This is the incidental truncation problem discussed in Section 19.6. A leading case is 
estimating a wage offer equation using a panel of individuals. Even if the population 
of interest is people who are employed in the initial year, some people will become 
unemployed in subsequent years. For those people we cannot observe a wage offer, 
just as in the cross-sectional case. This situation is different from the attrition prob- 
lem where people leave the sample entirely and, usually, do not reappear in later 
years. In the incidental truncation case we observe some variables on everyone in 
each time period. 


19.9.1 Fixed and Random Effects Estimation with Unbalanced Panels 


We begin by studying assumptions under which the usual fixed effects estimator on 
the unbalanced panel is consistent. The model is the usual linear, unobserved effects 
model under random sampling in the cross section: for any i, 


Vit = Xup + Ci + uit, t=1,...,T (19.95) 


where x; is 1 x K and $ is the K x 1 vector of interest. As before, we assume that NV 
cross section observations are available and the asymptotic analysis is as N — oo. We 
first cover the case where c; is allowed to be correlated with xx, so that all elements of 
Xj, are time varying. 
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We treated the case where all T time periods are available in Chapters 10 and 11. 
Now we consider the case where some time periods might be missing for some of the 
cross section draws. Think of t = 1 as the first time period for which data on anyone 
in the population are available, and t= T as the last possible time period. For a 
random draw i from the population, let s; = (s;1,...,s;r)' denote the T x 1 vector of 
selection indicators: s;, = 1 if (xi, Yy) is observed, and zero otherwise. Generally, we 
have an unbalanced panel. We can treat {(x;,y;,8;): i= 1,2,...,N} as a random 
sample from the population; the selection indicators tell us which time periods are 
missing for each i. 

We can easily find assumptions under which the fixed effects estimator on the 
unbalanced panel is consistent by writing it as 


n N T = N T 
p= (m 5 5. v8) (m 5 5 v8) 


= 71 i=l) 11 


NT E N T 
= B + (x 5 5 “88 (x 5 5 säja) 5 (19.96) 
i=] t=1 i=1 t=1 


where we define 
T T Y 

w =i << —] oz, 

Xi = Xa — T; X SirXir, Üu = Vir — T; X Sir Virs and T;= X Sit 
r=1 r=1 t=1 


That is, 7; is the number of time periods observed for cross section i, and we apply 
the within transformation on the available time periods. 

If fixed effects on the unbalanced panel is to be consistent, we should have 
E(s;X/,Uir) = 0 for all t. Now, since x; depends on all of x; and s;, a form of strict 
exogeneity is needed. 


ASSUMPTION 19.8: (a) E(uit|xi,8i,c:) =0, t=1,2,..., T; (b) SZ Elsu, žu) is 
nonsingular; and (c) E(uju! | x;, s;, ci) = 07Ir. 


Under assumption 19.8a, E(spX/ un) =0 from the law of iterated expectations 
(because sX; is a function of (x;,s;)). The second assumption is the rank condition 
on the expected outer product matrix, after accounting for sample selection; natu- 
rally, it rules out time-constant elements in Xy. These first two assumptions ensure 
consistency of FE on the unbalanced panel. 

In the case of a randomly rotating panel, and in other cases where selection is 
entirely random, s; is independent of (u;,x;,c;), in which case assumption 19.8a 


830 Chapter 19 


follows under the standard fixed effects assumption E(u; | x;,c;) = 0 for all ¢. In this 
case, the natural assumptions on the population model imply consistency and asym- 
potic normality on the unbalanced panel. Assumption 19.8a also holds under much 
weaker conditions. In particular, it does not assume anything about the relationship 
between s; and (x;, c;). Therefore, if we think selection in all time periods is correlated 
with c; or x;, but that uj, is mean independent of s; given (x;, c;) for all ¢, then FE on 
the unbalanced panel is consistent and asymptotically normal. This assumption may 
be a reasonable approximation, especially for short panels. What assumption 19.8a 
rules out is selection that is partially correlated with the idiosyncratic errors, uir. 

When we add assumption 19.8c, standard inference procedures based on FE are 
valid. In particular, under assumptions 19.8a and 19.8c, 


T T 
s 2 afa 
var( > säl) =a; | > Bl) ; 
=l (=l 


Therefore, the asymptotic variance of the fixed effects estimator is estimated as 


N T -1 
ôa (E Zasa) l (19.97) 


The estimator 6? can be derived from 


aps si) =E B sE s)| = E{ Tilo; (l — 1/T)]} = of E[(T; — 1)]. 
t=1 i=l 


Now, define the FE residuals as uj, = Py — Xah when są = 1. Then, because 
NENT: -1) SET,- 1), 


N N T N -=l y 
ô? = I- AOT: -1| NYOY siig = en J SONS siti; 
i=l 


i=l tl i=l i=l =l 
is consistent for o? as N —> oo. Standard software packages also make a degrees-of- 
freedom adjustment by subtracting K from 5 (7; — 1). It follows that all of the 
usual test statistics based on an unbalanced fixed effects analysis are valid. In partic- 
ular, the dummy variable regression discussed in Chapter 10 produces asymptotically 
valid statistics. 

Because the FE estimator uses time demeaning, any unit i for which T; = 1 drops 
out of the fixed effects estimator. To use these observations we would need to add 
more assumptions, such as the random effects assumption E(c; | x;,s;) = 0. 


-1 
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Relaxing assumption 19.8c is easy: just apply the robust variance matrix estimator 
in equation (10.59) to the unbalanced panel. The only changes are that the rows of X; 
are SXi and the elements of ü; are szü, t= 1,..., T. 

Under assumption 19.8, it is also valid to used a standard fixed effects analysis on 
any balanced subset of the unbalanced panel; in fact, we can condition on any out- 
comes of the s;,. For example, if we use unit 7 only when observations are available in 
all time periods, we are conditioning on sy = 1 for all t. Comparing FE on the un- 
balanced panel with FE on a balanced panel is a sensible specification check. 

Using similar arguments, it can be shown that any kind of differencing method on 
any subset of the observed panel is consistent. For example, with T = 3, we observe 
cross section units with data for one, two, or three time periods. Those units with 
T; = 1 drop out, but any other combinations of differences can be used in a pooled 
OLS analysis. The analogues of assumption 19.8 for first differencing—for example, 
assumption 19.8c is replaced with E(Au;Au/|x;,s;,c;) = o2 Ir-;—ensure that the 
usual statistics from pooled OLS on the unbalanced first differences are asymptoti- 
cally valid. 

Random effects estimation on an unbalanced panel is somewhat more complicated 
because of the GLS transformation, and its consistency hinges on stronger assump- 
tions concerning the selection mechanism. In addition to a rank condition, the key 
extra requirement (compared with FE) is 


E(c; | xi,8;) = E(ci), (19.98) 


so that the heterogeneity is mean independent of the covariates and selection in all 
time periods. Assumption (19.98) is generally very restrictive. It does allow for units 
being randomly rotated in and out of a panel or for rotation to depend on the 
observed factors in x;. 

A careful analysis of GLS estimation with unbalanced panels is notationally cum- 
bersome when we explicitly introduce the selection indicators. Nevertheless, the 
description of the random effects estimator is easy to describe because of the “‘ex- 
changeable” nature of the random effects covariance structure. In particular, under 
assumption RE.3 from Section 10.4 (for the balanced case), the covariance between 
any two time periods is the same. Along with the assumption of constant variance 
over time, this leads to a straightforward GLS transformation. In the balanced 
case, the data are quasi-time-demeaned using (an estimate of) the parameter 2 = 
1 — [o2/(a2 + To2)|'/*; see Section 10.7.2. For the unbalanced case, where selection 
is exogenous in the sense of assumption (19.98), quasi-time-demeaned applies, but the 
parameter depends on i, namely, 
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Ai = 1 — [o}/ (02 + Tyo”), (19.99) 


where T; = pe Sir is the number of time periods available for unit i. Unlike in the 
balanced case, 4; is properly viewed as a random variable because it is a function 
of T;. But, under assumption 19.8a and assumption (19.98), T; is appropriately ex- 
ogenous because E(v; | x;, 7;) = 0, where vj, = c; + uir. Therefore, transforming the 
equation using any function of 7; still leads to consistent estimation under assump- 
tion (19.98). See Baltagi (2001, Section 9.2) for the case where T; is treated as 
nonrandom. 

If we knew 4;, the RE estimator on the unbalanced panel would just be pooled 
OLS on the quasi-time-demeaned data Y, = yy — Ai~; and Xj = Xi — 7;X;, where, 
naturally, the time averages are of the observed data. To operationalize the RE esti- 
mator, we need initial estimators of a? and o2, but these are obtained by modifying 
the estimators in the unbalanced case. 

With modern software, implementing random effects on an unbalanced panel is 
straightforward. But one should know that the RE estimator imposes strong as- 
sumptions on the nature of the missing data mechanism. Further, as in the balanced 
case, there are good reasons to believe the RE variance structure should not be taken 
literally. That is, even if we assume, in addition to assumption 19.8a, assumption 
19.8c, and assumption (19.98), 


Var(c; | x;,8;) = Var(c:) = ož, (19.100) 


it is still likely that the composite variance covariance matrix in the population, 
Var(v;|x;), does not have the RE structure. We already know in the balanced case 
that using the incorrect variance matrix in GLS estimation does not cause inconsis- 
tency in estimating f, provided the explanatory variables are strictly exogenous with 
respect to vj, = C; + Ug. Not surprisingly, in the unbalanced case we effectively must 
add strict exogeneity of the selection process. Under this assumption, RE on the un- 
balanced panel is generally consistent and /N-asymptotically normal for any vari- 
ance structure for Var(v;|x;,¢;). But we need to compute a fully robust variance 
matrix estimator for Bp-, which is typically available in modern econometrics pack- 
ages for the balanced and unbalanced cases. 


19.9.2 Testing and Correcting for Sample Selection Bias 


The results in the previous subsection imply that sample selection in a fixed effects 
context is only a problem when selection is related to the idiosyncratic errors, uir. 
Therefore, any test for selection bias should test only this assumption. A simple test 
was suggested by Nijman and Verbeek (1992) in the context of random effects esti- 
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mation, but it works for fixed effects as well: add, say, the lagged selection indicator, 
Si,r-1, to the equation, estimate the model by fixed effects (on the unbalanced panel), 
and do a ¢ test (perhaps making it fully robust) for the significance of s; ,1. (This 
method loses the first time period for all observations.) Under the null hypothesis, uj; 
is uncorrelated with s» for all r, and so selection in the previous time period should 
not be significant in the equation at time f. (Incidentally, it never makes sense to put 
Si, in the equation at time ¢ because s; = 1 for all i and ¢ in the selected subsample.) 

Putting s; +1 does not work if s; ;-) is unity whenever s; is unity because then there 
is no variation in s; ;—1 in the selected sample. This is the case in attrition problems if 
(say) a person can only appear in period ¢ if he or she appeared in ¢ — 1. An alterna- 
tive is to include a lead of the selection indicator, s; +1. For observations i that are in 
the sample every time period, s;;,; is always zero. But for attriters, s; +1 switches 
from zero to one in the period just before attrition. Alternatively, we can define a 
variable at t, say 7;,;41, which is the number of periods after period ¢ that unit 7 is in 
the sample. Either way, if we use fixed effects or first differencing, we need T > 2 time 
periods to carry out the test. 

For RE we have other possibilities, such as adding 7; as an additional regressor 
and using a ¢ test. 

For incidental truncation problems it makes sense to extend Heckman’s (1976) test 
to the unobserved effects panel data context. This is done in Wooldridge (1995a). 
Write the equation of interest as 


Vin = Xin By + ca + Uin, t=1,...,T (19.101) 


Initially, suppose that y, is observed only if the binary selection indicator, sin, is 
unity. Let xy denote the set of all exogenous variables at time ¢; we assume that these 
are observed in every time period, and xj; is a subset of x;;. Suppose that, for each 1, 
Si is determined by the probit equation 


Sin = Uxiwy + vin > 0], vin |X; ~ Normal(0, 1), (19.102) 


where x; contains unity. This is best viewed as a reduced-form selection equation: we 
let the explanatory variables in all time periods appear in the selection equation at 
time ź to allow for general selection models, including those with unobserved effect 
and the Chamberlain (1980) device discussed in Section 15.8.2, as well as certain 
dynamic models of selection. A Mundlak (1978) approach would replace x; with 
(Xj, Xi) at time ¢ and assume that coefficients are constant across time. (See equation 
(15.68).) Then the parameters can be estimated by pooled probit, greatly conserving 
on degrees of freedom. Such conservation may be important for small N. For testing 
purposes, under the null hypothesis it does not matter whether equation (19.102) is 
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the proper model of sample selection, but we will need to assume equation (19.102), 
or a Mundlak version of it, when correcting for sample selection. 

Under the null hypothesis in assumption 19.8a (with the obvious notational 
changes), the inverse Mills ratio obtained from the sample selection probit should not 
be significant in the equation estimated by fixed effects. Thus, let hin. be the estimated 
Mills ratios from estimating equation (19.102) by pooled probit across i and t. Then 
a valid test of the null hypothesis is a ¢ statistic on Aig in the FE estimation on the 
unbalanced panel. Under assumption 19.8c the usual ¢ statistic is valid, but the 
approach works whether or not the u; are homoskedastic and serially uncorrelated: 
just compute the robust standard error. Wooldridge (1995a) shows formally that 
the first-stage estimation of y, does not affect the limiting distribution of the t 
statistic under Ho. This conclusion also follows from the results in Chapter 12 on 
M-estimation. 

Correcting for sample selection requires much more care. Unfortunately, under 
any assumptions that actually allow for an unobserved effect in the underlying 
selection equation, adding Aw to equation (19.101) and using FE does not produce 
consistent estimators. To see why, suppose 


Si = 1[x;j,62 + Cin + din > 0], ain | (X; Cits cin) ~ Normal(0, 1). (19.103) 


Then, to get equation (19.102), vj2 depends on aj and, at least partially, on cj. 
Now, suppose we make the strong assumption E(wjq | Xj, Ci, Ci2, Vi2) = ga + Pind; 
which would hold under the assumption that the (u;i, ain) are independent across t 
conditional on (X;, cj, ¢j2). Then we have 


View = Xin By + Pp, E(vi | Xi, 8:2) + (ca +ga) + ein + py [Vix — E(vin | Xi, 8;2)]- 


The composite error, ej + p1 [vin — E(vi | x;,8;2)], is uncorrelated with any function 
of (x;,8;2). The problem is that E(vj | x;,8;2) depends on all elements in s;2, and this 
expectation is complicated for even small T. 

A method that does work is available using Chamberlain’s approach to panel data 
models, but we need some linearity assumptions on the expected values of uj; and c; 
given x; and vj. 


ASSUMPTION 19.9: (a) The selection equation is given by equation (19.102); 
(b) Elwin | Xi, vin) = E(uin | vin) = pati, t=1,...,T; and (c) E(ca|xi,vi2) = 
L(ci |1, X; vin). 


The second assumption is standard and follows under joint normality of (uin, Vin) 
when this vector is independent of x;. Assumption 19.9c implies that 


E(ci |X; 0i2) = Xi%1 + by Vin, 
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where, by equation (19.102) and iterated expectations, E(c¢j | x;) = xj + E(vin | Xir) 
= x;m,. These assumptions place no restrictions on the serial dependence in (uin , Vin). 
They do imply that 


E( vin | Xi, Vin) = XinBy + XiT + Ya Vin. (19.104) 
Conditioning on sin = 1 gives 
E( Vin | Xi, Sia = 1) = xinBy + xim + yy A(KiWy)- 


Therefore, we can consistently estimate f} by first estimating a probit of sin on x; for 
each ź and then saving the inverse Mills ratio, Jj2, all i and t. Next, run the pooled 
OLS regression using the selected sample: 


Yit OD Xin, Xi, hia, a2 hit, e3 dT Ain for all sin = 1, (19.105) 


where d2, through dT, are time dummies. If y, in equation (19.104) is constant across 
t, simply include Aj by itself in equation (19.105). Again, a simplification that can be 
practically useful is to replace x; with the time average, X;. 

The asymptotic variance of f} needs to be corrected for general heteroskedas- 
ticity and serial correlation, as well as first-stage estimation of the w. These correc- 
tions can be made using the formulas for two-step M-estimation from Chapter 12; 
Wooldridge (1995a) contains the formulas. Alternatively, the panel bootstrap can be 
used, where the resampling is done using the cross section units. 

If the selection equation is of the Tobit form, we have somewhat more flexibility. 
Write the selection equation now as 


Vio = max(0, XY p + vir), vin |X; ~ Normal(0, a3), (19.106) 


where y; is observed if yj. >0. Then, under assumption 19.8, with the Tobit 
selection equation in place of equation (19.102), consistent estimation follows from 
the pooled regression (19.105) where hin is replaced by the Tobit residuals, ĉ;2 when 
Vio > 0 (Sin = 1). The Tobit residuals are obtained from the T cross section Tobits in 
equation (19.106); alternatively, especially with small N, we can use a Mundlak-type 
approach and use pooled Tobit with xjy,. replaced with x;;d2 + X;m2; see equation 
(17.82). 

It is easy to see that we can add « y,,. to the structural equation (19.101), provided 
we make an explicit exclusion restriction in assumption 19.9. In particular, we 
must assume that E(c¢j | x;, v2) = Xanı + ¢,,vi2, and that xj, is a strict subset of xj. 
Then, because y,,. is a function of (x;, vin), we can write E( Y;a | xi, vi) = xin By + 
1 Vi + Xanı +4 Vi2. We obtain the Tobit residuals, oj for each t, and then run the 
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regression yj, ON Xin, Vin, Xi and Ôi (possibly interacted with time dummies) for 
the selected sample. If we do not have an exclusion restriction, this regression suffers 
from perfect multicollinearity. As an example, we can easily include hours worked in 
a wage offer function for panel data, provided we have a variable affecting labor 
supply (such as the number of young children) but not the wage offer. 

A pure fixed effects approach is more fruitful when the selection equation is of the 
Tobit form. The following assumption comes from Wooldridge (1995a): 


ASSUMPTION 19.10: (a) The selection equation is equation (19.106). (b) For some 
unobserved effect gi, E(win | Xi, ci, gi, Vi2) = E(uin | gi, vi2) = gaa + py vin. 


Under part b of this assumption, 
E( Vin | Xi, Vi2, Ca, git) = Xin By + Piin + Sa, (19.107) 


where fa = ca + gi. The same expectation holds when we also condition on s;2 
(since s;2 is a function of x;, v;2). Therefore, estimating equation (19.107) by fixed 
effects on the unbalanced panel would consistently estimate f; and p,. As usual, we 
replace vj with the Tobit residuals 62 whenever y,. > 0. A t test of Ho: p; = 0 is 
valid very generally as a test of the null hypothesis of no sample selection. If the {uj} 
satisfy the standard homoskedasticity and serial uncorrelatedness assumptions, then 
the usual ż statistic is valid. A fully robust test may be warranted. (Again, with an 
exclusion restriction, we can add y,, as an additional explanatory variable.) 

Wooldridge (1995a) discusses an important case where assumption 19.10b holds: 
in the Tobit version of equation (19.103) with (u,,a;2) independent of (x;, Cn, ci2) 
and E(uin | ai2) = E(uin | ain) = pain. The second-to-last equality holds under the 
common assumption that { (uin, ain): t= 1,..., T} is serially independent. 

The preceding methods assume normality of the errors in the selection equation 
and, implicitly, the unobserved heterogeneity. Kyriazidou (1997) and Honoré and 
Kyriazidou (2000b) have proposed methods that do not require distributional 
assumptions. Dustmann and Rochina-Barrachina (2007) apply Wooldridge’s (1995a) 
and Kyriazidou’s (1997) methods to the problem of estimating a wage offer equation 
with selection into the work force. 

Semykina and Wooldridge (in press) show how to extend sample selection correc- 
tions for panel data to the case of endogenous explanatory variables. As in the case of 
cross section corrections, the endogenous explanatory variables need not be observed 
in all time periods. As shown formally by Semykina and Wooldridge (in press), a 
simple test for sample selection bias—without taking a stand on endogeneity of the 
explanatory variables—is obtained by adding inverse Mills ratio terms to the unbal- 
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anced panel and using fixed effect 2SLS. Corrections are available by adding time 
averages of the exogenous variables (both exogenous explanatory variables and the 
instruments) and applying pooled 2SLS with inverse Mills ratio terms (which, as usual, 
are estimated in a first stage). See Semykina and Wooldridge (in press) for details. 


19.9.3 Attrition 


We now turn specifically to testing and correcting for attrition in a linear, unobserved 
effects panel data model. General attrition, where units may reenter the sample after 
leaving, is complicated. We analyze a common special case. At f= 1 a random 
sample is obtained from the relevant population—people, for concreteness. In t = 2 
and beyond, some people drop out of the sample for reasons that may not be entirely 
random. We assume that, once a person drops out, he or she is out forever: attrition 
is an absorbing state. Any panel data set with attrition can be set up in this way by 
ignoring any subsequent observations on units after they initially leave the sample. In 
Section 19.9.2 we discussed one way to test for attrition bias when we assume that 
attrition is an absorbing state: include s; ,,; as an additional explanatory variable in a 
fixed effects analysis, or use the number of subsequent periods in the sample. 

One method for correcting for attrition bias is closely related to the corrections for 
incidental truncation covered in the previous subsection. Write the model for a ran- 
dom draw from the population as in equation (19.95), where we assume that (xj, Vi) 
is observed for all i when t = 1. Let s; denote the selection indicator for each time 
period, where s; = 1 if (xj, y;,) are observed. Because we ignore units once they ini- 
tially leave the sample, są = 1 implies sy = 1 forr < t. 

The sequential nature of attrition makes first differencing a natural choice to re- 
move the unobserved effect: 


A Yg = AXirB + Auz, FS 2) A (19.108) 
Conditional on s; ,-; = 1, write a (reduced-form) selection equation for t > 2 as 
Sit = Lf] wyd; + vi > 0], Vit | {AXiz, Wit, Si 1 = 1} ~ Normal(0, 1), (19.109) 


where w; must contain variables observed at time ¢ for all units with s; ,; = 1. Good 
candidates for w; include the variables in x; ,-; and any variables in x; that are 
observed at time ¢ when s;,_; = 1 (for example, if x; contains lags of variables or a 
variable such as age). In general, the dimension of w; can grow with ¢. For example, 
if equation (19.95) is dynamically complete, then y; ,_, is orthogonal to Aup, and so 
it can be an element of w;. Since y;,; is correlated with u;,,;, it should not be 
included in wy. 
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Under assumption (19.108), selection in time ¢, conditional on being in the sample 
at time ¢— 1 (s;,,-) = 1), follows a probit model: 


P(siz = 1 | Wit, Si,t—1 = 1) = O(w;,0;), t= 2, ...3 T: (19.110) 


Therefore, we can estimate ô+, t = 2,..., T, by a sequence of probits where, in each 
time period, one uses the units still in the sample in the previous time period. 

In order to justify a two-step Heckman correction, we need to assume that, for 
those still in the sample at t— 1, x;, does not affect selection in time ź once we con- 
dition on wy. A natural way to state this condition is 


P(siz = 1 | AXir, Wits Si,t—1 = 1) = P(sit = 1 | Wits Si, 1—1 = 1). (19.111) 


Technically, assumption (19.111) is not strong enough. An assumption that is 
sufficient—stated in terms of the errors in the first-differenced equation—is that, 
conditional on s; ;-; = 1, 


(Aujz, viz) is independent of (Axj;, Wir), (19.112) 


which, of course, implies that v; is independent of (Ax;,,w;,) and implies condition 
(19.111) when v; has a standard normal distribution. We also impose the standard 
linear functional form assumption: 


E(Auj: | vit, Si t-1 = 1) = ptit, (19.113) 
in which case 
E(Ayi; | AXit, Wir, Sit = 1) = AxiB + p,A(wirds), b= 2 ac N (19.114) 


where /(w;,0;) is the inverse Mills ratio. 

Notice how, because s; -1 = 1 when s = 1, we do not have to condition on s; s1 
in equation (19.113). It now follows from equation (19.113) that pooled OLS of A y; 
on AX, Dds. a Alhi t=2,..., T, where the Ae are from the T — 1 cross section 
probits in equation (19.110), is consistent for £; and the p,. A joint test of Ho: p, = 0, 
t=2,...,7, is a fairly simple test for attrition bias, although nothing guarantees 
serial independence of the errors. 

There are two potential problems with this approach. For one, assumption (19.112) 
is restrictive because it means that x; does not affect attrition once the elements in Ww; 
have been controlled for. This is a very strong assumption. In fact, in one scenario 
where this assumption fails, it is actually harmful to apply the Heckman approach. 
To see how this result can occur, suppose that selection is actually a function of 
changes in the covariates in the sense that 
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P(sit = 1 | AX, Autir) = P(Si = 1 | Axx). (19.115) 


Under this condition, pooled OLS estimation of equation (19.108), using the un- 
balanced panel, is consistent for $. By contrast, the Heckman method just described 
is generally inconsistent if Ax; is not included in wx: it is very unlikely that 


E(Au;i | AXit, Wit, Vit, Si 1-1 = 1) = E(Auir | vir, Si -1 = 1) 


(because v generally cannot be assumed to be independent of Ax; if Ax; is not 
included in wi). 

A second shortcoming of the Heckman procedure just described is that it requires 
strict exogeneity of {x;,} in the original (levels) equation. Fortunately, in some cases 
we can apply an instrumental variables approach that is consistent when {x;,} is not 
strictly exogenous or always observed at time f. 

Let Z; be a vector of variables such that z; is redundant in the selection equation 
(possibly because w; contains Zy) and that Z; is exogenous in the sense that equation 
(19.112) holds with z; in place of Ax;,; for example, Z; should contain x; for r < t. 
Now, using an argument similar to the cross section case in Section 19.6.2, we can 
estimate the equation 


Ayy = AXinB + py dÂ +--+ pp AT it + errori (19.116) 


by instrumental variables with instruments (Z;r, d2iĝin, Jae at Aa), using the selected 
sample. For example, the pooled 2SLS estimator on the selected sample is consistent 
and asymptotically normal, and attrition bias can be tested by a joint test of Ho: 
Pp, =9,t=2,...,7. Under Ho, only serial correlation and heteroskedasticity adjust- 
ments are possibly needed. If Ho fails we have the usual generated regressors problem 
for estimating the asymptotic variance. Other IV procedures, such as GMM, can also 
be used, but they too must account for the generated regressors problem. The panel 
bootstrap is a simple alternative to the analytical formulas. 


Example 19.9 (Dynamic Model with Attrition): Consider the model 
Vie = BaP + Vir + Ci + un, t=1,...,T, (19.117) 


where we assume that (y,9,8;, Ya) are all observed for a random sample from the 
population. Assume that E(wj;|;,¥;,:-1,---, Vio, Ci) = 0, so that g; is strictly exoge- 
nous. Then the explanatory variables in the probit at time 7, wi, can include g; ,_,, 
Yi 2, and further lags of these. After estimating the selection probit for each ¢, and 
differencing, we can estimate 


AYVi, = Agab + MAY; 1-1 + P3 AB phy toe + Pr AT Ay + errory 
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by pooled 2SLS on the selected sample starting at ¢= 3, using instruments 
(Si 1-1» Bi. 1-2» Vi.t-2» Vir-3)- AS usual, there are other possibilities for the instruments. 


Although the focus in this section has been on pure attrition, where units disappear 
entirely from the sample, the methods can also be used in the context of incidental 
truncation without strictly exogenous explanatory variables. For example, suppose 
we are interested in the population of men who are employed at ¢ = 0 and t = 1, and 
we would like to estimate a dynamic wage equation with an unobserved effect. 
Problems arise if men become unemployed in future periods. Such events can be 
treated as an attrition problem if all subsequent time periods are dropped once a man 
first becomes unemployed. This approach loses information but makes the econo- 
metrics relatively straightforward, especially because, in the preceding general model, 
X; will always be observed at time t and so can be included in the labor force par- 
ticipation probit (assuming that men do not leave the sample entirely). Things be- 
come much more complicated if we are interested in the wage offer for all working 
age men at f= 1 because we have to deal with the sample selection problem into 
employment at f= 0 and t= 1. 

The methods for attrition and selection just described apply only to linear models, 
and it is difficult to extend them to general nonlinear models. An alternative ap- 
proach is based on inverse probability weighting (IPW), which we described in the 
cross section case in Section 19.8. 

Moffitt, Fitzgerald, and Gottschalk (1999) (MFG) propose IPW to estimate linear 
panel data models under possibly nonrandom attrition. (MFG propose a different 
set of weights, analogous to those studied by Horowitz and Manski (1998), to solve 
missing data problems. The weights we use require estimation of only one attrition 
model, rather than two as in MFG.) IPW must be used with care to solve the attri- 
tion problem. As before, we assume that we have a random sample from the popu- 
lation at t= 1. We are interested in some feature, such as the conditional mean, or 
maybe the entire conditional distribution, of y, given xx. Ideally, at each t we would 
observe (y;,,Xi) for any unit that was in the random sample at t= 1. Instead, we 
observe (Y; Xir) only if sy = 1. We can easily solve the attrition problem if we assume 
that, conditional on observables in the first time period, say, zi, (Yip Xir) is indepen- 
dent of si: 


P(si¢ = 1 | Yi Xi Zi) = P(S = 1 | za), t= 2 hsagd (19.118) 


As in the cross section case, assumption (19.118) has been called “selection on 
observables” in the econometrics literature and “‘ignorable selection” or ‘“‘uncon- 
foundedness”’ in the statistics literature. (The Heckman method we just covered is 
more like the “selection on unobservables” that we discussed in Section 19.8.) Con- 
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dition (19.118) is a strong assumption in that it maintains that, in every time period, 
Zi; is such a good predictor of attrition that the distribution of sy given {Zi1, (Xit, Vir) } 
does not depend on (x;;, Yir). If we have such an observed set of variables z;1, it is not 
too surprising that we can account for nonrandom attrition for a large class of esti- 
mation problems. 

As in the cross section case, IPW estimation typically involves two steps. First, for 
each ¢, we estimate a probit or logit of s;, on Z;. (A crucial point is that the same 
cross section units—namely, all units appearing in the first time period—are used in 
the probit or logit for each time period.) Let p;, be the fitted probabilities, t= 2,..., 
T, i=1,...,N. In the second step, the objective function for (i,t) is weighted by 
1/p;,. For general M-estimation, the objective function is 


T 
D Sit/Pa)d:(Wirs 0), (19.119) 


i=] t=1 


where Wi = (Yi, Xit) and q:(Wi, 0) is the objective function in each time period. As 
usual, the selection indicator s; chooses the observations where we actually observe 
data. (For t= 1, Si =P; = 1 for all i.) For least squares, g;(wi,@) is simply the 
squared residual function; for partial MLE, q:(Wi,0) is the log-likelihood function. 

The argument for why IPW works is similar to the pure cross section case. Let 0, 
denote the value of 0 that solves the population problem ming-@ eer Ela:(wir, 9)]. 
Let ô? denote the true values of the selection response parameters in each time pe- 
riod, so that P(s = 1| zi) = p,(zi,6;) = pg. Now, under standard regularity con- 
ditions, we can replace p} with fp = p,(zi1,6;) without affecting the consistency 
argument. So, apart from regularity conditions, it is sufficient to show that 0, mini- 
mizes >", El(si/p2)qi(wir,@)| over @. But, from iterated expectations, 


El(sic/ Pit) qe(Wir, 0)] = EXE[(sir/ pt) ae(Wir, 0) | wie, zal} 
= E{[E(sit | wit, 2) /Pilar(wit, 9) } = Elgr (wir, 0)] 


because E(sir | Wi, Zi) = P(si = 1|zi1) by assumption (19.118). Therefore, the prob- 
ability limit of the weighted objective function is identical to that of the unweighted 
function if we had no attrition problem. Using this simple analogy argument and 
standard two-step estimation results from Chapter 12 shows that the inverse proba- 
bility weighting produces a consistent, v N-asymptotically normal estimator. The 
methods for adjusting the asymptotic variance matrix of two step M-estimators— 
described in Subsection 12.5.2—can be applied to the IPW M-estimator from equa- 
tion (19.119). The panel bootstrap that accounts for the two estimation steps is also 
attractive. 
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MFG propose an IPW scheme where the conditioning variables in the attri- 
tion probits change across time. In particular, at time ¢ an attrition probit is 
estimated restricting attention to those units still in the sample at time ¢— 1. (Out of 
this group, some are lost to attrition at time ft, and some are not.) If we assume that 
attrition is an absorbing state, we can include in the conditioning variables, z;,, all 
values of y and x dated at time ¢ — 1 and earlier (as well as other variables observed 
for all units in the sample at ¢ — 1). This approach is appealing because the ignorability 
assumption is much more plausible if we can condition on both recent responses and 
covariates. (That is, P(sj = 1] Wit, Wir-1,---; Wa, Sic-1 = 1) = P(e = 1|Wir-t,---, 
Wi, 5i,--1 = 1) is more likely than assumption (19.64).) Unfortunately, obtaining the 
fitted probabilities in this way and using them in an IPW procedure does not gener- 
ally produce consistent estimators. The problem is that the selection models at each 
time period are not representative of the population that was originally sampled at 
t= 1. Letting p? = P(s;, = 1|Wis-1,-.-, Wi, Si-1 = 1), we can no longer use the 
iterated expectations argument to conclude that E[(si/p?)g:(wir, 8)| = Elg: (Wi, 8)]. 
Only if Elg;(wi, 0)] = Elg:(wit, 9) | 5i,2-1 = 1] for all O does the argument work, but 
this assumption essentially requires that w; be independent of s; +1. 

It is possible to allow the covariates in the selection probabilities to increase in 
richness over time, but the MFG procedure must be modified. For the case where 
attrition is an absorbing state, we can extend to the M-estimation case the nonlinear 
regression results of Robins, Rotnitzky, and Zhao (1995) (RRZ). It turns out that 
in some cases the probabilities to be used in IPW estimation can be constructed 
sequentially: 


PilOP) = Mi2(y3)m(73)-*-Ae(7?), t= 2,...,T7, (19.120) 
where 
Tuly?) = P( sie = 1 | Zin 1,21 = 1). (19.121) 


In other words, as in the MFG procedure, we estimate probit models at each time 
t, restricted to units that are in the sample at t — 1. The covariates in the probit are 
essentially everything we can observe for units in the sample at time ¢ — 1 that might 
affect attrition. For t = 2,..., T, let 7, denote the fitted selection probabilities. Then 
we construct the probability weights as the product p,, = iĝis- -fi and use the 
objective function (19.119). Naturally, this method only works under certain 
assumptions. The key ignorability of selection condition can be stated as 


P(siz = 1 | va, eee VT: Si t—-1 = 1) = P(Si = 1 | Zits Si,t-1 = 1), (19.122) 
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where vy = (Wi, Zi). Now, we must include future values of wx and Zy in the con- 
ditioning set on the left-hand side. Assumption (19.122) is fairly strong, but it does 
allow for attrition to be strongly related to past outcomes on y and x (which can be 
included in Z;). 

It is easy to see how assumption (19.122) leads to equation (19.120). First, with 
Vi = (Vi, V2,---, Vir), we can always write 


P(sit 1 | v;) P(siz 1 | Vi, Si,t-1 = 1) a - P(si2 = 1 | Vi, Si = 1)P(s; = 1 | v;) 


= P(Si =1 | Vi, Si 1—1 = 1) aio -P(si2 =] | v;) (19.123) 


because we assume s; = 1. Under assumption (19.122), P(s# = 1 |V; 52-1 = 1) = 
P( si = 1| Zit, 5i,+-1 = 1) for t > 2, and so 


pa = P(sie = 1 vi) = Pls = 1 | Zit, S11 = 1) ++ P(s2 = 1] z2), 


which is exactly equation (19.121). 

As in the cross section case, we can use the “surprising” efficiency result described 
in Section 13.10.2 to show that, under assumption (19.122), estimating the attrition 
probabilities is actually more efficient than using the known probabilities (if we 
could). The key is that we have fully specified D(s;1, 52,...,s;r | v;). Because {wi: t = 
1,..., T} is contained in v;, the conditional independence assumption (13.66) (with 
an appropriate change in notation) holds. To realize the efficiency gain in the com- 
puted variance matrix estimator, we partial out from the weighted M-estimator score 
the score from the first-stage MLE. Define 


T 
k(S;, Zi, Wi, y, 0) = X [si/Pulô) (6;)] Jr; (Wir, 9), 


i= 


H 


where r,(wj,9) = Voq(Wi, 0) is the P x 1 score of the unweighted objective function. 
Further, let d;(ô) denote the score of the conditional log likelihood 


T 
5 Si, 1—1 {Su log(t(Zit, i) + (1 — sir) log[l — m(Zir y:)]}, 


t=2 


where ô denotes the vector of y,, f= 2,...,7. Typically, in each time period we 
would estimate a logit or probit. Then, d;(d) is just the long column vector of stacked 
scores for each logit or probit, selected at time ¢ so that we use only the units in the 
sample at ¢— 1. That is, 


d;(ô)' = [Si1 2 (Si2, Zi2; Y2), S283 (Si, Z3; y3). -<3 Si, -18r (Sir, ZiT; Yr)], 
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where g,(si, Zi; Y,) is Just the gradient (row vector) for the binary response model at 
time ¢, using units in the sample at t — 1. Then, as in the cross section case, let 


a N A a N A A = a 
ê =k; - (m Sia) Ge yaa) d; 
i=l i=l 


be the P x 1 residuals from the multivariate regression of k; on d;, i=1,...,N, 
where all hatted quantities are evaluated at 6 or O, (where the latter is the IPW 
M-estimator). Further, define 


N T 
A =N! 5 X [s:/pu(ô:)] H: (wir, ô,). 


i=] t=1 


The asymptotic variance of 6, is consistently estimated as ADA! /N, which allows 
for general serial dependence in the scores over time as well as violation of the in- 
formation matrix equality in each time period. (So, for example, with nonlinear re- 
gression, or QMLE in the linear exponential family, the conditional variances may be 
arbitrary.) 

An alternative, as usual with large N and small T, is to include both steps of the 
two-step procedure in a panel bootstrapping routine. The obtained standard errors 
will properly reflect the increased efficiency from estimating the y,. 

RRZ (1995) show that, in the case of nonlinear regression where the explanatory 
variables are always observed—a leading case occurs when the covariates are 
observed in the first period and do not change over time—condition (19.122) can be 
relaxed somewhat, and the IPW method still produces consistent, asymptotically 
normal estimators. When modeling a feature of D(y;,|xj;) when the covariates are 
not always observed, IPW suffers from the same problem we discussed in Section 
19.8. Namely, if the feature of D(y,,| xi) is correctly specified (and the objective 
function has been properly chosen), and if selection is a function of the covariates in 
the sense that 


P( sit =1 | Migs Vit Si, 1—1 = 1) = P(Sit =1 | Misia = 1), 


then the unweighted pooled M-estimator is consistent. If x; is not fully observed at 
time ¢, then it cannot be included in zx, and condition (19.122) is very likely to fail. 
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As in the case of the Heckman approach applied to a first-differenced linear equation, 
the weighting is not innocuous: it would actually lead to inconsistent estimation. 
Thus, in the case of attrition in the leading case where interest lies in a feature 
of D(y;| Xx), and some elements of xy are unobserved in time ¢ for those leaving 
the sample, one must be careful in interpreting important differences between the 
weighted and unweighted estimators: the problem might be with the weighted esti- 
mator. (Of course, both estimators might be inconsistent, too.) 


Problems 


19.1. In the setup of Section 19.2.1, suppose that the costs, 7;, do not vary across i. 
What features of D( y; | x;) can we consistently estimate? 


19.2. Some occupations, such as major league baseball, are characterized by salary 
floors. This situation can be described by 


y = exp(xf + u), u|x ~ Normal(0, o°) 


w = max(f, y), 

where f > 0 is the common salary floor (the minimum wage), y is the person’s true 
worth (productivity), and x contains human capital and demographic variables. 

a. Find the log-likelihood function for a random draw 7 from the population. 

b. How would you estimate E(y |x)? 

c. Is E(w|x) of much interest in this application? Explain. 

19.3. Let y be the percentage of annual income invested in a pension plan, and 
assume that current law caps this percentage at 10 percent. Therefore, in a sample 


of data, we observe y; between zero and 10, with pileups at the two corners, zero 
and 10. 


a. What model would you use for y that recognizes the pileups at zero and 10? 


b. Explain the conceptual difference between the outcomes y= 0 and y= 10. In 
particular, which limit can be viewed as a form of data censoring? 

c. How would you estimate the partial effects on the expected contribution percent- 
age assuming that the current law always will be in effect? 


d. Suppose you want to ask, What would be the effect on E(y |x), for any value of x, 
if the contribution cap were increased from 10 to 11? How would you estimate the 
effect? [Hint: In the general two-limit Tobit model, call the upper bound, in general, 
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a (which is known, not a parameter to estimate), and then take the partial derivative 
with respect to a2.] 


e. If there are no observations at y = 10, what does the estimated model reduce to? 
Does this finding seem sensible? 


19.4. a. Suppose you are hired to explain fire damage to buildings in terms of 
building and neighborhood characteristics. If you use cross section data on reported 
fires, is there a sample selection problem due to the fact that most buildings do not 
catch fire during the year? 


b. If you want to estimate the relationship between contributions to a 401(k) plan 
and the match rate of the plan—the rate at which the employer matches employee 
contributions—is there a sample selection problem if you only use a sample of 
workers already enrolled in a 401(k) plan? 


19.5. In example 19.4, suppose that /Q is an indicator of abil, and KWW is another 
indicator (see Section 5.3.2). Find assumptions under which IV on the selected sam- 
ple is valid. 


19.6. Let f(-|x;;@) denote the density of y; given x; for a random draw from the 
population. Find the conditional density of y; given (x;,s; = 1) when the selection 
rule is s; = 1[a,(x;) < y; < a(x;,)], where a(x) and a(x) are known functions of x. 
In the Hausman and Wise (1977) example, a2(x) was a function of family size be- 
cause the poverty income level depends on family size. 


19.7. Suppose in Section 19.6.1 we replace assumption 19.1d with 

E(u; | v2) = p02 + y2(v3 — 1). 

(We subtract unity from v3 to ensure that the second term has zero expectation.) 

a. Using the fact that Var(v2|v2 > —a) = 1 — A(a)[A(a) + al, show that 

E(y, |X, ¥2 = 1) = xB, + y14(x82) — 72A(x62) xd). 

[Hint: Take a= xô, and use the fact that E(v}|v2 > —a) = Var(v2|v2 > —a) + 
[E(v2 | v2 > —a)]?.] 

b. Explain how to correct for sample selection in this case. 

c. How would you test for the presence of sample selection bias? 


19.8. Consider the following alternative to procedure 19.2 when y, is always 
observed. First, run the OLS regression of y, on z and obtain the fitted values, 7. 
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Next, get the inverse Mills ratio, Âz, from the probit of y, on z. Finally, run the OLS 
regression yı On Z1,),43 using the selected sample. 


a. Find a set of sufficient conditions that imply consistency of the proposed proce- 
dure. (Do not worry about regularity conditions.) 


b. Show that the assumptions from part a are more restrictive than those in proce- 
dure 19.2, and give some examples that are covered by procedure 19.2 but not by the 
alternative procedure. 


19.9. Apply procedure 19.4 to the data in MROZ.RAW. Use a constant, exper, and 
exper? as elements of zı; take y, = educ. The other elements of z should include age, 
kidslt6, kidsge6, nwifeinc, motheduc, fatheduc, and huseduc. 


19.10. Consider the model 


Yı = Zô; + vı 


Yı = 102 + 02 
y3 = max(0, 031.7) + 032, + 2303 + u3), 


where (z, y), y3) are always observed and yı is observed when y, > 0. The first two 
equations are reduced-form equations, and the third equation is of primary interest. 
For example, take y; = log(wage°), y, = educ, and y; = hours, and then education 
and log(wage°) are possibly endogenous in the labor supply function. Assume that 
(v1, v2, u3) are jointly zero-mean normal and independent of z. 


a. Find a simple way to consistently estimate the parameters in the third equation 
allowing for arbitrary correlations among (v1, v2, 3). Be sure to state any identifica- 
tion assumptions needed. 

b. Now suppose that y, is observed only when y, > 0; for example, y, = 
log(wage®), y, = log(benefits°), y, = hours. Now derive a multistep procedure for 
estimating the third equation under the same assumptions as in part a. 


c. How can we estimate the average partial effects? 


19.11. Consider the following conditional moment restrictions problem with a 
selected sample. In the population, E|r(w, 8%) |x] = 0. Let s be the selection indicator, 
and assume that 


E[r(w, 0.) | x, s] = 0. 


Sufficient is that s = f(x) for a nonrandom function f. 
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a. Let Z; be a G x L matrix of functions of x;. Show that 8, satisfies 
E[s;Z;r(wi, 05)] =0 


b. Write down the objective function for the system nonlinear 2SLS estimator based 
on the selected sample. Argue that, under the appropriate rank condition, the esti- 
mator is consistent and //N-asymptotically normal. 

c. Write down the objective function for a minimum chi-square estimator using the 
selected sample. Use the estimates from part b to estimate the weighting matrix. 
Argue that the estimator is consistent and /N-asymptotically normal. 


19.12. Consider model (19.56), where selection is ignorable in the sense that 
E(u; |z, v3) = 0. However, data are missing on y, when y, = 0, and E(y» |Z, y3) 4 
E(y2|2). 

a. Find E(y, | z, y3). 

b. If, in addition to assumption 19.2, (v2,v3) is independent of z and E(v2| v3) = 
7203, find E(y; |z, y3 = 1). 

c. Suggest a two-step method for consistently estimating ô; and «;. 

d. Does this method generally work if E(u | z, y3) # 0? 


e. Would you bother with the method from part c if E(u |z, y2, y3) =0? 
Explain. 


19.13. In Section 17.6 we discussed two-part models for a corner solution out- 
come, say, y. These models have sometimes been studied in the context of incidental 
truncation. 

a. Suppose you have a parametric model for the distribution of y conditional on x 
and y > 0. (Cragg’s model and the lognormal model from Section 17.6 are exam- 
ples.) If you estimate the parameters of this model by conditional MLE, using only 
the observations for which y; > 0, do the parameter estimates suffer from sample 
selection bias? Explain. 


b. If instead you specify only E(y |x, y > 0) = exp(xf) and estimate $ by nonlinear 
least squares using observations for which y, > 0, do the estimates suffer from sample 
selection bias? 

c. In addition to the specification from part b, suppose that P(y=0|x) = 
1 — ®(xy). How would you estimate y? 


d. Given the assumptions in parts b and c, how would you estimate E(y |x)? 
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e. Given your answers to the first four parts, do you think viewing estimation of two- 
part models as an incidental truncation problem is appropriate? 


19.14. Consider equation (19.30) under the assumption E(u|z,s) = E(u|s) = 
(1 — s)x9 + so). The first equality is the assumption; the second is unrestrictive, as it 
simply allows the mean of u to differ in the selected and unselected subpopulations. 
a. Show that 2SLS estimation using the selected subsample consistently estimates the 
slope parameters, f>, ..., Bg. What is the plim of the intercept estimator? (Hint: Re- 
place u with (1 — s)ao + sa, +e, where E(e|z,s) = 0.) 

b. Show that E(u|z,s) = E(u] s) if (u, s) is independent of z. Does independence of s 
and z seem reasonable? 


19.15. Suppose that y given x follows a standard censored Tobit, where y is a cor- 
ner solution response. However, there is at least one element of x that we can observe 
only when y > 0. (An example is seen when y is quantity demanded of a good or 
service, and one element of x is price, derived as total expenditure on the good 
divided by y whenever y > 0.) 


a. Explain why we cannot use standard censored Tobit maximum likelihood esti- 
mation to estimate $ and a”. What method can we use instead? 


b. How is it that we can still estimate E(y |x), even though we do not observe some 
elements of x when y = 0? 


19.16. Consider a standard linear model with an endogenous explanatory variable: 
yı = 20, + %1y2 + u 

y2 = 162 + v2, 

but where y» is censored from above. Let r be the censoring variable (which may be 
random) and define w = min(r2, y2). Assume that (u1, v2) is independent of (z, r2) 
and E(u; | v2) = p102. 

a. Show that 

E(y1 |Z, 12, 02) = 210) + a1 y2 + p102. 

b. Define s2 = l[y2 < w2], so s2 = 1 means yz is observed. How come 

E(y1 |Z, r2, 02, 82) = E(y1|Z,r2, 02) = 211 + ayo + p102? 


c. Propose a two-step estimator that consistently estimates ô and a (along with p,) 
where only the data with y; uncensored are used in the second step. How would you 
obtain valid standard errors? 
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19.17. Consider an unobserved effects panel data model with binary censoring: 

Vit = Xu + Ci + üi 

Wit = 1 [Vie > Tit], 

where rj, is the known censoring value for unit i in time ¢. Our goal is to estimate $, 
which, in the absense of censoring, we could achieve by fixed effects, first difference, 
random effects (under stronger assumptions), or some variation on these. 


a. Maintain that the censoring values are strictly exogenous conditional on c; (and 
X; = (Xj,---,X;7)) along with normality: 


D(uit | Xi, ri, ci) = Normal(0, 2), 


au 


where r; = (r1,/2,---,ir). Explain what kinds of censoring this assumption rules 
out. 


b. Allow censoring to depend on c; using the Chamberlain-Mundlak device: 
ci = Y + Xič + Fj + ai 
D(a; | Xi, r;) = Normal(0, o2), 


“a 


where, as usual, the overbar means time average. In other words, heterogeneity may 
be (partially) correlated with the average censoring value. Using this assumption with 
that in part a, show that P(w» = 1|x;,r;) follows a probit model and obtain the 
coefficients on the various explanatory variables. 


c. Explain how to consistently estimate f without further assumptions (other than 
that both {x;} and {rz} have sufficient time variation). 


d. What additional assumption would allow you to also estimate o2 and g2? Explain 
how you would use this assumption in estimation. 


19.18. Suppose that in the population, D(y |x) follows a Tobit(xf, a?) distribution. 
However, we observe y only when it is strictly positive. (Such a scenario is common 
for on-site sampling, where only individuals known to participate in an activity are 
sampled. An example is y is annual days spent visiting national parks, and surveys 
are taken at national parks.) 


a. How would you estimate f and a” given a random sample from the subpopulation 
with y > 0? 

b. How would you estimate partial effects of the x; on E(y|x)? How and why does 
this approach differ from the usual truncated regression setup? 


c. Can you estimate a hurdle model of the kind covered in Section 17.6? Explain. 
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19.19. Consider a count random variable, y;, with conditional probability density 
fC |x). Suppose that r; > 0 is a right censoring point, and assume that P(y; = y | xi, 
ri) = P(y; = y|x;). For a random draw i from the population, we observe the ran- 
dom variable w; = min( y;, ri). 

a. Let F,(-|x) denote the conditional cdf of y; given x; = x. Find the density of w; 
given (x;,r;) for w = 0, 1,...,r;. 

b. Find the density from part a if y; has a conditional Poisson distribution with mean 
exp(x;ß). 

c. If you estimate $ by maximum likelihood using the right censored data, is the 
resulting estimator generally robust to failure of the Poisson distribution? Explain. 
d. Use your answer to part a to discuss the costs of data censoring for a count 
variable. 


e. What other underlying population distribution might you use, and why? 


20 Stratified Sampling and Cluster Sampling 


20.1 Introduction 


In this chapter we study estimation when the data have been obtained by means of 
two common nonrandom sampling schemes. Stratified sampling occurs when units in 
a population are sampled with probabilities that do not reflect their frequency in the 
population. For example, in obtaining a data set on families, low-income families 
might be oversampled and high-income families undersampled. There are various 
mechanisms by which stratified samples are obtained, and we will cover the most 
common ones in this chapter. 

The case of truncated sampling covered in Section 19.7 can be viewed as an 
extreme case of stratified sampling, where part of the population is not sampled at all. 
For the most part, in this chapter we focus on the case where the entire population is 
sampled (but where the sampling frequencies differ from the population frequencies). 
As we will see, in this setup simple weighting methods are available for recovering the 
underlying population parameters. 

Cluster sampling refers to cases where clusters or groups, rather than individual 
units, are drawn from a population. For example, in evaluating the impact of edu- 
cational policies on test performance of fourth-graders in Michigan, one might sam- 
ple classrooms from the entire state (as opposed to randomly drawing fourth-graders 
from the population of all fourth-graders in Michigan). The classrooms constitute the 
clusters, and the students within the classrooms are the inividual units. The cluster 
sampling scheme generally implies that the outcomes of units within a cluster are 
correlated through unobserved “‘cluster effects.” (In addition, some covariates, such 
as quality of the teacher, will be perfectly correlated because students in the same 
class have the same teacher. Other covariates, such as family income, are likely to 
have substantial correlation but would vary within a classroom.) 

When a cluster sample is obtained by randomly drawing clusters from a large 
population of clusters, the resulting data set has features in common with the panel 
data sets we have studied throughout the book. Namely, we have many clusters that 
can be assumed to be independent of each other, but observations within a cluster are 
correlated. In, say, a firm-level panel data set, the firm plays the role of a cluster and 
time plays the role of the individual units. Because of their statistical similarity to 
“large N, small T” panel data sets, most of the statistical methods applied to cluster 
samples are familiar from our earlier analysis. We treat cluster samples separately 
because the nature of the within-cluster correlation is generally different from time 
series correlation, and cluster samples are naturally unbalanced even when there is no 
sample selection problem. (For example, in the population of fourth-grade class- 
rooms in Michigan, we expect some variation in class size.) 
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20.2 Stratified Sampling 


We begin with an analysis of stratified samples where, as mentioned in the introduc- 
tion, different subsets of the population are sampled with different frequencies. 
Obtaining samples that are not representative of the underlying population is often 
done intentionally in obtaining survey data. Some surveys are designed primarily to 
learn about a particular subset of the population (perhaps based on income, educa- 
tion, age, or race). That group is typically overrepresented in the sample compared 
with its frequency in the population. 

Stratification can be based on exogenous variables or endogenous variables (which 
are known once a model and assumptions have been specified) or some combination 
of these. As in the case of the sample selection problems we discussed in Chapter 19, 
it is important to know which is the case. 

We cover the two most common types of stratified sampling in this section (and 
touch on a third). In Section 20.2.1 we study standard stratified sampling, which 
involves stratifying the population and then drawing random samples from the dif- 
ferent strata. A different sampling scheme, variable probability sampling, is based on 
randomly drawing units from a population but then keeping the observations with 
unequal probabilities. 

The section does not provide a detailed treatment of choice-based sampling, which 
occurs in discrete response models when the stratification is based entirely on the 
response variable. Various methods have been proposed for estimating discrete 
response models with choice-based samples under different assumptions. Manski and 
McFadden (1981) and Cosslett (1993) contain general treatments. For a class of 
discrete response models, Cosslett (1981) proposed an efficient estimation method 
with choice-based sampling, and Imbens (1992) obtained a computationally simple 
method-of-moments estimator that also achieves the efficiency bound. Imbens and 
Lancaster (1996) allow for general response variables in a maximum likelihood set- 
ting. In this section, we focus on a convenient weighted-estimation approach that 
applies to a variety of estimation methods. Not surprisingly, when applied in maxi- 
mum likelihood contexts, weighted estimators are generally inefficient. 


20.2.1 Standard Stratified Sampling and Variable Probability Sampling 


The two most common stratification schemes used in obtaining data sets in the social 
sciences are standard stratified sampling (SS sampling) and variable probability sam- 
pling (VP sampling). In SS sampling, the population is first partitioned into J groups, 
W\,W>,...,Wy, which we assume are nonoverlapping and exhaustive. We let w 
denote the random vector representing the population of interest. 
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STANDARD STRATIFIED SAMPLING: For j= 1,...,J, draw a random sample of size N; 
from stratum j. For each j, denote this random sample by {wy: i = 1,2,..., Nj}. 


The strata sample sizes N; are nonrandom. Therefore, the total sample size, N = 
Ni +---+ Ny, is also nonrandom. A randomly drawn observation from stratum 
j, Wij, has distribution D(w| we #4). Therefore, while observations within a stratum 
are identically distributed, observations across strata are not. A scheme that is similar 
in nature to SS sampling is called multinomial sampling, where a stratum is first 
picked at random and then an observation is randomly drawn from the stratum. This 
does result in i.1.d. observations, but it does not correspond to how stratified samples 
are obtained in practice. It also leads to the same estimators as under SS sampling, so 
we do not discuss it further; see Cosslett (1993) or Wooldridge (1999b) for further 
discussion. 

Variable probability samples are obtained using a different scheme. First, an obser- 
vation is drawn at random from the population. If the observation falls into stratum /, 
it is kept with probability p;. Thus, random draws from the population are discarded 
with varying frequencies depending on which stratum they fall into. This kind of 
sampling is appropriate when information on the variable or variables that determine 
the strata is relatively easy to obtain compared with the rest of the information. 
Survey data sets, including initial interviews to collect panel or longitudinal data, are 
good examples. Suppose we want to oversample individuals from, say, lower income 
classes. We can first ask an individual her or his income. If the response is in income 
class j, this person is kept in the sample with probability p;, and then the remaining 
information, such as education, work history, family background, and so on can be 
collected; otherwise, the person is dropped without further interviewing. 

A key feature of VP sampling is that observations within a stratum are discarded 
randomly. As discussed by Wooldridge (1999b), VP sampling is equivalent to the 
following: 


VARIABLE PROBABILITY SAMPLING: Repeat the following steps N times: 


1. Draw an observation w; at random from the population. 
2. If w; is in stratum j, toss a (biased) coin with probability p, of turning up heads. 
Let hj = 1 if the coin turns up heads and zero otherwise. 


3. Keep observation i if hy = 1; otherwise, omit it from the sample. 


The number of observations falling into stratum j is denoted N;, and the number of 
data points we actually have for estimation is Nop = Nj + M2 +---+ Nj. Notice that 
if N—the number of times the population is sampled—is fixed, then No is a random 
variable: we do not know what each N; will be prior to sampling. 
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The assumption that the probability of the coin turning up heads in step 2 depends 
only on the stratum ensures that sampling is random within each stratum. This 
roughly reflects how samples are obtained for certain large cross-sectional and panel 
data sets used in economics, including the panel study of income dynamics and the 
national longitudinal survey. 

To see that a VP sample can be analyzed as a random sample, we construct a 
population that incorporates the stratification. The VP sampling scheme is equivalent 
to first tossing all J coins before actually observing which stratum w; falls into; this 
gives (/j1,...,hji7). Next, w; is observed to fall into one of the strata. Finally, the 
outcome is kept or not depending on the coin flip for that stratum. The result is that 
the vector (w;,h;), where h; is the J-vector of binary indicators hj, is a random sam- 
ple from a new population with sample space Wx #, where W is the original 
sample space and # denotes the sample space associated with outcomes from flip- 
ping J coins. Under this alternative way of viewing the sampling scheme, h; is inde- 
pendent of w;. Treating (w;,h;) as a random draw from the new population is not at 
odds with the fact that our estimators are based on a nonrandom sample from the 
original population: we simply use the vector h; to determine which observations are 
kept in the estimation procedure. 


20.2.2 Weighted Estimators to Account for Stratification 


With variable probability sampling, it is easy to construct weighted objective func- 
tions that produce consistent and asymptotically normal estimators of the popula- 
tion parameters. Initially, it is useful to define a set of binary variables that indicate 
whether a random draw, w;, is kept in the sample and, if so, which stratum it falls 


into. Let z = 1[w; e Wj], 7 = 1,...,J be the binary strata indicators, that is, z; = 1 if 
and only if w; e W;. The vector of strata indicators is z; = (zj1,...,Ziy). Then define 
Tij = hijZij, J= lr (20.1) 


By definition, r; = 1 for at most one j. If h; = 1 then rj = zj, so that rj is the same 
as the stratum indicator. If rj = 0 for all j = 1,2,...,J, then the random draw w; 
does not appear in the sample (and we probably do not know which stratum the 
observation fell into). 

With these definitions, we can define the weighted M-estimator, 6,,, as the solution to 


J 


N 
. -1 
min > P; rjq(wi, 0), (20.2) 


=] j=l 


where q(w,0) is the objective function that is chosen to identify the population 
parameters 0,. Note how the outer summation is over all potential observations, that 
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is, the observations that would appear in a random sample. The indicators rj simply 
pick out the observations that actually appear in the available sample, and these 
indicators also attach each observed data point to its stratum. The objective function 
(20.2) weights each observed data point in the sample by the inverse of the sampling 
probability. For implementation it is useful to write the objective function as 


min >) p; a(wi,0), (20.3) 


where, without loss of generality, the data points actually observed are ordered i = 
1,..., No. Since j; is the stratum for observation i, Pj is the weight attached to ob- 
servation i in the estimation. In practice, the Pj, are the sampling weights reported 
with other variables in VP stratified samples. 

The objective function q(w,0) contains all of the M-estimator examples we have 
covered so far in the book, including least squares (linear and nonlinear), conditional 
maximum likelihood, and partial maximum likelihood. In panel data applications, 
the probability weights are from sampling in an initial year. Weights for later years 
are intended to reflect both stratification (if any) and possible attrition, as discussed in 
Section 19.9.3. 

In the case of estimating the mean from a population, the resulting weighted M- 
estimator has a familiar form. Let u, = E(w) be the population mean. Then the 
weighted M-estimator solves 


No 
= =] fo pd 
min 2 P; (wi- u), (20.4) 


and the solution is easily seen to be the weighted average 


fy = 5 v; Wi, (20.5) 


No al 
v4 = (>: z) B (20.6) 


In the general case, Wooldridge (1999b) shows that, under the same assumptions 
as Theorem 12.2 and the assumption that each sampling probability is positive, 
the weighted M-estimator consistently estimates 0,, which is assumed to uniquely 
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minimize E/q(w, 0)]. Actually, as shown by Wooldridge (2007), consistency follows 
from our treatment of inverse-probability-weighted M-estimation in Section 19.8. As 
noted earlier, the vector z; is the J-vector of strata indicators, z; = 1[w; € Wj]. Under 
VP sampling, the sampling probability depends only on the stratum, so Assumption 
19.6 holds by design. In particular, define s; = haza +--+- + hizziz to be the selection 
indicator. Then 


P(s; = 1|z;,w;) = P(s; = 1|z;) = piza + pozn + +++ + pyZis. (20.7) 


Therefore, we can use the consistency result for IPW M-estimation directly to estab- 
lish the consistency of the IPW estimator for VP sampling. 

Asymptotic normality also follows under the same regularity conditions as in 
Chapter 12. Wooldridge (1999b) shows that a valid estimator of the asymptotic 
variance of 6, is 


-1 


No 
ys Pj, Vi qi(0 w w| $ Pj, -VoqilB w hc) j> pi Vial) 5 (20.8) 
i=l 


which looks like the standard formula for a robust variance matrix estimator except 
for the presence of the sampling probabilities p,,. 

When w partitions as (x, y), an alternative estimator replaces the Hessian Vjqi(0) 
in expression (20.8) with A(x;,6,,), where A(x;, 05) = E[Vjq(w;, 00) | xi], as in Chap- 
ter 12. Asymptotic standard errors and Wald statistics can be obtained using either 
estimate of the asymptotic variance. 

We can also apply the “surprising” efficiency result concerning estimation of the 
sampling probabilities to VP stratification—at least if additional information is kept 
during the sampling. For a random draw i the log likelihood for the density of s; 
given z; can be written as 


J 
m [s: log(p;) + (1 — s;) log(1 — p;)]. (20.9) 


For each j= 1,...,J, the maximum likelihood estimate, p;, is easily seen to be the 
fraction of observations retained out of all of those originally drawn from from 
stratum j: p; = M;/N;, where M; = SA 248; and N; = EÀ zy. In other words, M; 
is the number of retained data points from stratum j, and N; is the number of times 
stratum j was drawn in the VP sampling scheme. If the Nj, j= 1,...,J/, are reported 
along with the VP sample, then we can easily obtain the p; (because the M; are 
always known). We do not need to observe the specific strata indicators for obser- 
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vations for which s; = 0. (If the stratification is exogenous, as defined in Section 20.2.3, 
then it does not matter whether we use the estimated or known sampling proba- 
bilities: the asymptotic variance is unchanged in that case.) 


Example 20.1 (Linear Model under Stratified Sampling): In estimating the linear 
model 


y=xß, +u, E(x'u)=0 (20.10) 


by IPW least squares, the asymptotic variance matrix estimator is 


No -1 / M No I 
(Sim's) (Soca) (Soot) conn) 
j i=l i=l 


where ù; = y; — xj, is the residual after WLS estimation. Interestingly, this is simply 
the White (1980b) heteroskedasticity-consistent covariance matrix estimator applied 
to the stratified sample, where all variables for observation i are weighted by p” E 
before performing the regression. This estimator has been suggested by, among others, 
Hausman and Wise (1981). Hausman and Wise use maximum likelihood to obtain 
more efficient estimators in the context of the normal linear regression model, that is, 
u|x ~ Normal(xf,,a2). Because of stratification, MLE is not generally robust to 
failure of the homoskedastic normality assumption. 

It is important to remember that the form of expression (20.11) in this example is 
not due to potential heteroskedasticity in the underlying population model. Even if 
E(u? |x) = 2, the estimator (20.11) is generally needed because of the stratified 
sampling. This estimator also works in the presence of heteroskedasticity of arbitrary 
and unknown form in the population, and it is routinely computed by many regres- 
sion packages. 


Example 20.2 (Conditional MLE under Stratified Sampling): When f(y|x;@) is a 
correctly specified model for the density of y; given x; in the population, the inverse- 
probability-weighted MLE is obtained with q;(0) = —log| f (y; | xi; @)]. This estimator 
is consistent and asymptotically normal, with asymptotic variance estimator given by 
expression (20.8) [or the form that uses A(x;, 0,,)]. 


A weighting scheme is also available in the standard stratified sampling case, but 
the weights are different from the VP sampling case. To derive them, let Q; = 
P(w e W;) denote the population frequency for stratum j; we assume that the Qj; are 
known. By the law of iterated expectations, 


Eļa(w, 0)] = QiElg(w, 8) |w € Wi] +--+ QJElg(w, 0) |w € W7] (20.12) 
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for any 0. For each j, E[g(w, 6) | we #%;] can be consistently estimated using a ran- 
dom sample obtained from stratum j. This scheme leads to the sample objective 
function 


MN Ny 
Q, i Samo) wy Yoana). 
l i=l 


where w;; denotes a random draw i from stratum j and N; is the nonrandom sample 
size for stratum j. We can apply the uniform law of large numbers to each term, so 
that the sum converges uniformly to equation (20.12) under the regularity conditions 
in Chapter 12. By multiplying and dividing each term by the total number of obser- 
vations N = N; +---+ N;, we can write the sample objective function more simply as 


qs Oy 


wy O;,/H;,)q(wi, 9), (20.13) 


where j; denotes the stratum for observation i and H; = N;/N denotes the fraction of 
observations in stratum j. Because we have the stratum indicator j;, we can drop the 
j subscript on w;. When we omit the division by N, equation (20.13) has the same 
form as equation (20.3), but the weights are (Q;,/H;,) rather than P; (and the argu- 
ments for why each weighting works are very different). Also, in general, the formula 
for the asymptotic variance is different in the SS sampling case. In addition to the 
minor notational change of replacing No with N, the middle matrix in equation (20.8) 
becomes 


J Nj 
24 (O7/H?)| X (Voâĝ; — Voq) (Vody — Voq) |; (20.14) 


j=l i=l 


where Voq; = Vog (wi, ô.) and Voq; = ie Voĝ;; (the within-stratum sample 
average). This approach requires us to explicitly partition observations into their 
respective strata. See Wooldridge (2001) for a detailed derivation. [If in the VP 
sampling case the population frequencies Q; are known, it is better to use as weights 
Q;/(N;/No) rather than Pe which makes the analysis look just like the SS sampling 
case. See Wooldridge (1999b) for details. | 

If in Example 20.2 we have standard stratified sampling rather than VP sam- 
pling, the weighted MLE is typically called the weighted exogenous sample MLE 
(WESMLE); this estimator was suggested by Manski and Lerman (1977) in the 
context of choice-based sampling in discrete response models. (Actually, Manski and 
Lerman (1977) use multinomial sampling where H; is the probability of picking 
stratum j. But Cosslett (1981) showed that a more efficient estimator is obtained by 
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using N;/N, as one always does in the case of SS sampling; see Wooldridge (1999b) 
for an extension of Cosslett’s result to the M-estimator case.) 

Provided that the sampling weights Q;,/Hj, or Pj," are given (along with the stra- 
tum), analysis with the weighted M-estimator under SS or VP sampling is fairly 
straightforward, but it is not likely to be efficient. In the conditional maximum like- 
lihood case it is certainly possible to do better. See Imbens and Lancaster (1996) for a 
careful treatment. 


20.2.3 Stratification Based on Exogenous Variables 


When w partitions as (x,y), where x is exogenous in a sense to be made precise, and 
stratification is based entirely on x, the standard unweighted estimator on the strati- 
fied sample is consistent and asymptotically normal. The sense in which x must be 
exogenous is that 0, solves 


min E[q(w, @) | x] (20.15) 


for each possible outcome x. This assumption holds in a variety of contexts with 
conditioning variables and correctly specified models. For example, as we discussed 
in Chapter 12, the condition holds for nonlinear regression when the conditional 
mean is correctly specified and @, is the vector of conditional mean parameters; in 
Chapter 13 we showed that this holds for conditional maximum likelihood when the 
density of y given x is correct. It also holds in other cases, including the quasi- 
maximum likelihood estimators we discussed in Chapter 18 when the conditional 
mean is correctly specified. One interesting point—which we will rely on in our 
treatment of estimating average treatment effects in Chapter 21—is that, in the linear 
case, it will not be enough for u to be uncorrelated with x. If we want to estimate the 
linear projection of y on x, we generally need to use the weighted estimator, even if 
stratification is a function of x. 

In the case of VP sampling, a common form of exogenous stratification occurs 
when the strata are defined in terms of x, and that is the case we treat here. (See 
Section 19.8 or Wooldridge (2007) for more general situations.) Then, again letting 
Si = haza +--+ + hisziz be the selection indicator, where each z; is a function of x;, 
P(s; = 1 | w;,x;) = P(s; = 1| x;). We can immediately apply the results of Section 19.8 
to conclude that the unweighted M-estimator is consistent. 

A direct proof is also informative. The unweighted M-estimator, using the strati- 
fied sample, Ô., minimizes 


N 
X siq(wi, O) =X XO hjzjgwi, 0), (20.16) 
i=1 


i=1 j=l 
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and consistency generally follows if we can show that the population value, 0), 
uniquely minimizes 


j= 


N 
E[hyzja(wi 0) = $ pElzya(wi, 0)], (20.17) 
j=l 


= 


where the equality follows because A; is independent of (z;j,w;) by the nature of VP 
sampling, and p; = E(hj). Now, because z; is a function of x;, it follows by iterated 
expectations that 


E[zq(wi, 0)] = E{E[zjq(w;, 8) | xi]} = E{ziEla(wi, 0) | x:]}. (20.18) 


By assumption, 0, minimizes E[q(w;, 0) | x;], and, because z; is a zero-one variable, 6, 
is also a minimizer of E|z;q(w;, 0)]. Now, with p; > 0 for all j, 0, is also a solution to 


N 
min 2 Elhyziq(wi, 0)], (20.19) 


which is what we wanted to show. 

Unlike in the case of the weighted estimator, uniqueness of 0, in the population is 
no longer sufficient for identification using the unweighted estimator. In particular, if 
p; = 0 for some j, part of the population is not sampled at all, and this may (but need 
not) result in lack of identification of 0,. As discussed in Wooldridge (2001) for the 
case of SS sampling, p; > 0 for all j ensures identification of 0, when it is identified in 
the population. 

Generally, when stratification is based on x, one can make a case for weighting if 
interest lies in the solution to the population problem 


min E[q(w,6)], (20.20) 


which we have called 0,. The IPW estimator consistently estimates 0, without further 
assumptions, while the unweighted estimator requires the stronger assumption 
described surrounding equation (20.15). A special case is the linear regression model 
discussed in Example 20.1: to consistently estimate the linear projection, we must use 
weights even if selection is based on x. Consistency of the unweighted estimator 
requires that we are estimating the conditional mean. In the next chapter, we will see 
other uses of this fact about the weighted versus unweighted estimator. 

Wooldridge (1999b) shows that the usual asymptotic variance estimators (see Sec- 
tion 12.5) are valid when stratification is based on x and we ignore the stratification 
problem. For example, the usual conditional maximum likelihood analysis holds. In 
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the case of regression, we can use the usual heteroskedasticity-robust variance matrix 
estimator. Or, if we assume homoskedasticity in the population, the nonrobust form 
(see equation (12.58)) is valid with the usual estimator of the error variance. 

When a generalized conditional information matrix equality holds, and stratifica- 
tion is based on x, Wooldridge (1999b) shows that the unweighted estimator is more 
efficient than the weighted estimator. The key assumption is 


E[Vog(w, 8o) Voq(w, 8o) |x] = a5E[Vjq(w, 8o) | x] (20.21) 


for some a? > 0. When assumption (20.21) holds and 8, solves equation (20.19), the 
asymptotic variance of the unweighted M-estimator is smaller than that for the 
weighted M-estimator. This generalization includes conditional maximum likelihood 
(with a2 = 1) and nonlinear regression under homoskedasticity. 

Very similar conclusions hold for standard stratified sampling. One useful fact is 
that, when stratification is based on x, the estimator (20.8) is valid with p; = H;/Q; 
(and No = N); therefore, we need not compute within-strata variation in the esti- 
mated score. The unweighted estimator is consistent when stratification is based on 
x and the usual asymptotic variance matrix estimators are valid. The unweighted 
estimator is also more efficient when assumption (20.21) holds. See Wooldridge 
(2001) for statements of assumptions and proofs of theorems. 

As a practical matter, modern statistical packages that have built-in features for 
analyzing data from stratified samples typically ask for two pieces of information: the 
sampling weights and the stratum identifier. If one specifies the weights but not the 
stratum identifier, the middle of the “sandwich” will not be estimated as in equation 
(20.14). The within-stratum averages will not be subtracted off, resulting in larger 
estimated asymptotic variances than necessary (except in the case of exogenous 
sampling under exogeneity of x). In other words, the resulting confidence intervals 
and inference will be (asymptotically) conservative. It is better to use the information 
on the strata along with the sampling weights. 


20.3 Cluster Sampling 


We now turn to the problem of cluster sampling, where individual units are sampled 
in groups or clusters. As mentioned in the introduction, the problems of cluster 
sampling and panel data analysis are similar in their statistical structures: each con- 
fronts the problem of correlation when observations come with a natural nesting. The 
similarities are strongest in the case where a large number of clusters, each relatively 
small, is drawn from a large population of clusters. This case is relatively easy to 
handle, and we treat it first in Section 20.3.1. 
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Many data sets have both a panel data and cluster sampling structure. Inference 
that is robust to serial correlation and cluster correlation is straightforward provided 
the number of clusters is large. We discuss this case in Section 20.3.2. 

Section 20.3.3 summarizes what is known about applying the usual cluster for- 
mulas when the cluster sizes are large rather than “small.” 

Recently, researchers have studied data structures that can be classified as a small 
number of clusters and many observations per cluster. Section 20.3.4 provides two 
methods for analyzing such structures. 


20.3.1 Inference with a Large Number of Clusters and Small Cluster Sizes 


We begin with the problem of estimating linear and nonlinear models when we can 
sample a large number of clusters from a large population of clusters. For each group 
or cluster g, let {( Vgm, Xg, Zgm): M = 1,...,M,} be the observable data, where M, is 
the number of units in cluster g, yym is a scalar response, x, is a 1 x K vector con- 
taining explanatory variables that vary only at the cluster level, and z,,, isa 1 x L 
vector of covariates that vary within (as well as across) groups. In most applications 
of cluster samples, at least some covariates change only at the group level; earlier we 
gave the example of teacher characteristics when each cluster is a classroom. In fact, 
it is probably a sensible rule to at least consider the data as being generated as a 
cluster sample whenever covariates at a level more aggregated than the individual 
units are included in an analysis. For example, in analyzing firm-level data, if 
industry-level covariates are included then we should treat the data as a cluster sam- 
ple, with each industry acting as a cluster. 

Throughout we assume that the sampling scheme generates observations that are 
independent across g. In other words, we independently draw G clusters from the 
population of all clusters. This assumption can be restrictive, particularly when the 
clusters are large geographical units. Nevertheless, in some cases we can define 
the clusters to allow additional “spatial correlation.” For example, if originally we 
think of sampling fourth-grade classrooms, but then we are worried about correlation 
in student performance not just within class but also within school, then we can de- 
fine the clusters to be the schools. What we will not cover is schemes where any two 
geographical units are allowed to be correlated, with correlation diminishing as the 
observations are farther apart in space. 

The theory with G — œ and the group sizes, M,, fixed is well developed; see, for 
example, White (1984) and Arellano (1987). Of course, it is up to the researcher to 
decide whether the sizes of G and the M, are suitable for this asymptotic framework. 
Here, we follow Wooldridge (2003a) and summarize these results and emphasize how 
one might have to use robust inference methods even when it is not so obvious. 
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Not surprisingly, linear models are easiest to analyze. The standard linear model 
with an additive error is 


Vom = &+ Xgß + Zym} + Vgm, m=1,...,Mj;g=1,...,G. (20.22) 


As with panel data, our approach to estimation and inference in equation (20.22) 
depends on several factors, including whether we are interested in the effects of 
aggregate variables ($) or individual-specific variables (y). In addition, we need to 
make assumptions about the error terms. An important issue is whether the Vgm 
contain a common group effect that can be separated in an additive fashion, as in 


Vgm = Cg + Ugm; m=1,...,My, (20.23) 


where cg is an unobserved cluster effect and ugm is the idiosyncratic error. Another 
important issue is whether the explanatory variables in equation (20.22) can be taken 
to be appropriately exogenous. If the covariates satisfy 


E(vgm | Xg, Zgm) = 9, m=1,...,My;g=1,...,G, (20.24) 


or even a zero-correlation version, the pooled OLS estimator, where we regress Vgm 
on 1, Xg, Zgm, m= 1,..., Mg; g =1,...,G, is consistent for 0 = (a, B’,y’)’ as G > œ 
with M, fixed. Further, the POLS estimator is /G-asymptotically normal. 

Without more assumptions, a robust variance matrix is needed to account for 
correlation within clusters or heteroskedasticity in Var(vgm | Xg, Zgm), or both. When 
Vgm has the form in equation (20.23), the amount of within-cluster correlation can be 
substantial, with the result that the usual OLS standard errors can be very misleading 
(and, in most cases, systematically too small). Write W, as the M, x (1 + K + L) 
matrix of all regressors for group g. Then the (1 + K + L) x (1 + K + L) variance 
matrix estimator is 


G G = 
Avar(6pors) = (sw: W, J (Sw: 4,9 w) (>. ww) (20.25) 
g=1 g=1 


where ¥, is the M, x 1 vector of pooled OLS residuals for group g. As we discussed 
for the panel data case, this “sandwich” variance matrix estimator is now computed 
routinely using “cluster” options in popular statistical packages. One simply needs to 
specify the cluster identifier. 

Pooled OLS estimation of the parameters in equation (20.22) ignores the within- 
cluster correlation of the Vgm in estimation, so that it can be very inefficient if cy is a 
part of the error vy». As we know from panel data analysis, if we strengthen the 
exogeneity assumption to 
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E(vgm|Xg, Z) =0, m=1,...,M;g=1,...,G, (20.26) 


where Z, is the M, x L matrix of unit-specific covariates, then we can exploit the 
presence of cy in equation (20.23) in a generalized least squares (GLS) analysis. 
Assumption (20.26) rules out covariates from one member of the cluster affecting 
the outcomes on another, holding own covariates fixed. At least nominally, this 
assumption appears to rule out “peer effects,” but such effects can be allowed by 
including measures of peers in Zym. 

The standard random effects approach makes enough assumptions so that the 
M, x M, variance-covariance matrix of vy = (Ugi, V92,- -, Vg, m,) has the “random 
effects” form, 


Var(vy) = oim im, + calm,» (20.27) 


where jų, is the M, x 1 vector of ones and Iv, is the M4 x Mọ identity matrix. In the 
standard setup, we also make the system homoskedasticity assumption, familiar from 
the panel data analysis in Chapter 10: 


Var(vy | Xg, Z4) = Var(v,). (20.28) 


As in the panel data case, it is important to understand the role of assumption 
(20.28): it implies that the conditional variance-covariance matrix is the same as the 
unconditional variance-covariance matrix, but it does not restrict Var(v,); it can be 
any M, x M, matrix under assumption (20.28). The particular random effects struc- 
ture on Var(v,) is given by assumption (20.27). Under assumptions (20.27) and 
(20.28), the resulting GLS estimator is the well-known random effects (RE) estima- 
tor. The estimator has the same structure as in the unbalanced panel data case; see 
Section 19.9.1. 

The random effects estimator Og, is asymptotically more efficient than pooled OLS 
under assumptions (20.26), (20.27), and (20.28) as G — oo with the M, fixed. The RE 
estimates and test statistics are computed routinely by popular software packages for 
cluster samples. Nevertheless, an important point is often overlooked in applications 
of RE: one can, and in many cases should, make inference completely robust to an 
unknown form of Var(v, | x,, Zg). 

The idea in obtaining a fully robust variance matrix for RE is straightforward, as 
we saw in Chapter 10 for panel data. Even if Var(v,|x,,Z,) does not have the RE 
form, the RE estimator is still consistent and /G-asymptotically normal under 
assumption (20.26), and it is likely to be more efficient than pooled OLS even if 
Var(v,|x,,Z,) does not have the RE form. The case for a fully robust variance 
matrix for RE is somewhat more subtle than in the panel data case, where serial 
correlation in the idiosyncratic errors generally invalidates assumption (20.27). Of 
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course, heteroskedasticity in Var(c,|x,,Z,) or Var(ugm | Xg, Zg) is always a possibil- 
ity, and either justifies robust inference. As an example, suppose that the coefficients 
ON Zym Vary at the cluster level: 


Vgm = A+ Xyp + ZymPq + Vgm, m= Lysec, Mg; g= lsi G, (20.29) 


By estimating a standard random effects model that assumes common slopes y, we 
effectively include Zgm(y, — y) in the idiosyncratic error; doing so generally creates 
within-group correlation because Zgn(y, — y) and Zgp(yg — y) will be correlated for 
m + p, conditional on Z,. Also, the idiosyncratic error will have heteroskedasticity 
that is a function of Zym. Nevertheless, if we assume E(y, | X4, Z,) = E(y,) = y along 
with assumption (20.26), the random effects estimator still consistently estimates the 
average slopes, y (and £). Therefore, in applying random effects to panel data or 
cluster samples, it is sensible (with large G) to make the variance estimator of random 
effects robust to arbitrary heteroskedasticity and within-group correlation. 

In applications, one often computes the POLS and RE estimates to see how sensi- 
tive the estimates are to choice of variance matrix. Further, one is tempted to com- 
pare estimated variance matrices—or, at least, standard errors—to see if RE is more 
efficient than POLS. It is fine to do so provided one uses fully robust standard errors 
for POLS and RE. For example, it certainly makes no sense to compare the usual 
POLS standard errors (which ignore the cluster sampling) with the usual RE stan- 
dard errors (which account for the clustering, at least to some extent). By comparing 
the fully robust forms for each set of estimates, one is comparing generally reliable 
estimates of the sampling variation of the POLS and RE estimates. 

If we are only interested in estimating y, the fixed effects (FE) or within estimator is 
attractive. The within transformation subtracts within-group averages from the 
dependent variable and explanatory variables: 


Vom — Pq = (Zgm — Zg) Y + Ugm — Ug, M = lrs May G = lyas G (20.30) 


and this equation is estimated by pooled OLS. (Of course, the x, get swept away by 
the within-group demeaning.) Under a full set of FE assumptions—which, as in the 
panel data case, allows arbitrary correlation between c, and the z,,,—inference is 
straightforward using standard software. Nevertheless, analogous to the random 
effects case, it is prudent to allow Var(u,|Z,) to have an arbitrary form, including 
within-group correlation and heteroskedasticity. For example, if we start with model 
(20.29), then (Zgm — Zy)(Yy — 7) appears in the error term. As we discussed in Section 
11.7.3, the FE estimator is still consistent if E(y, | Zgm — Zg) = E(y,) = y, an assump- 
tion that allows y, to be correlated with z,. Nevertheless, ug, and ug, will be corre- 
lated for m # p. A fully robust variance matrix estimator is 


868 Chapter 20 


1 G G —1 
Avar(ĵrr) (24 Z, J (>. žiai (>. ži) , (20.31) 
g=1 g=1 


where Z, is the matrix of within-group deviations from means and u, is the M, x 1 
vector of fixed effects residuals. This estimator is justified with large-G asymptotics. It 
has exactly the same form as the unbalanced panel data case. 

One benefit of a fixed effects approach in the standard model with constant slopes 
but c, in the composite error term is that no adjustments are necessary if the c, are 
correlated across groups. When the groups represent different geographical units, we 
might expect correlation across groups close to each other. If we think such correla- 
tion is largely captured through the unobserved effect c,, then its elimination by 
means of the within transformation effectively solves the problem. If we use pooled 
OLS or random effects, we would have to deal with spatial correlation across g, in 
addition to within-group correlation, a difficult statistical problem. 

An alternative to FE estimation, and one that leads to a simple Hausman test for 
comparing FE and RE, is to add the group averages to an RE estimation. Let Zg 
denote the vector of within-group averages, and write 


Ygm = & + Xgß + Lymy + Zg + ag + Ug, m=1,...,Mj;g=1,...,G, (20.32) 


where cy = Z, + ay (and we absorb the intercept here into «). Estimating this equa- 
tion by, say, RE allows us to easily test Hp: č = 0 in a fully robust way, which tests 
the null that the RE estimator is consistent. It can be shown that, even though the 
panel is not balanced, the estimate of y is the FE estimate. In addition, this approach 
allows us to estimate coefficients on x,. (Pooled OLS can also be used, and also 
delivers the FE estimate of y.) 


Example 20.3 (Cluster Correlation in Teacher Compensation): The data set in 
BENEFITS.RAW includes average compensation, at the school level, for teachers in 
Michigan. Interest lies in testing for a trade-off between salary and nonsalary com- 
pensation. We view this as a cluster sample of school districts, with the schools within 
districts representing the individual units. 

A standard approach is to estimate the equation 


log(avgsalgm) = & + Bi bSgm + By log(staff m) + P3 log(enrollgn) 
+ Balunchgm + Cg + Ugm (20.33) 


where avgsaly, is the average salary for school m in district g, bsgm = avgbengm/ 
avgsalgm, where avgbengm is the average benefits received by teachers, staff, is the 
number of staff per 1,000 students, enroll, is school enrollment, and lunchgm is the 
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Table 20.1 

Salary-Benefits Trade-off for Michigan Teachers 

Dependent Variable log(avgsal) 

(D Q) g 

Estimation Method Pooled OLS Random Effects Fixed Effects 

Explanatory Variable 

bs —0.177 —0.381 —0.495 
0.122) (0.112) (0.133) 
0.260] 0.150] 0.194] 

log(staff) —0.691 —0.617 —0.622 
0.018) 0.015) (0.017) 
0.035] 0.036] 0.043] 

log(enroll) —0.0292 —0.0249 —0.0515 
0.0085) 0.0076) (0.0094) 
0.0257] 0.0115] 0.0131] 

lunch —0.00085 0.00030 0.00051 
0.00016) 0.00018) (0.00021) 
0.00057] 0.00020] 0.00021] 

constant 13.724 13.367 13.618 
0.112) 0.098) (0.113) 
0.256] 0.197] 0.241] 

Number of districts 537 537 537 

Number of schools 1,848 1,848 1,848 


Quantities in parentheses are the nonrobust standard errors; those in brackets are robust to arbitrary 
within-district correlation as well as heteroskedasticity. 


The intercept reported for fixed effects is the average of the estimated district effects. 


The fully robust regression based Hausman test, with four degrees-of-freedom in the chi-square distribu- 
tion, yields H = 20.70 and p-value = 0.0004. 


percentage of students eligible for the federal free or reduced-price lunch program. 
Using the approximation log(1 + x) ~ x for “small” x, it can be shown that a dollar- 
for-dollar trade-off in salary and benefits is the same as f; = —1. 

We estimate the equation using three methods: pooled OLS, random effects, and 
fixed effects. The results are given in Table 20.1. The table contains the nonrobust 
standard errors for each method—that is, the standard errors computed under the 
“ideal”? set of assumptions for the particular estimator—along with the standard 
errors that are robust to arbitrary within-district correlation and heteroskedasticity. 

The POLS estimates provide little evidence of a trade-off between salary and ben- 
efits. The coefficient is negative, but its value, —0.177, is pretty small, and not close to 
—1 (the hypothesized value for a one-for-one trade-off between salary and benefits). 
Its fully robust ¢ statistic is less than 0.7 in magnitude. Notice that the robust stan- 
dard error, which properly accounts for the cluster nature of the data, is more than 
twice as large as the nonrobust one. 
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The magnitude of the random effects coefficient bs is notably larger than the 
pooled OLS estimate, and it is statistically different from zero, even using the fully 
robust standard error. The RE transformation removes a fraction of the district 
averages. (The fraction depends on the number of schools in a district, and it ranges 
from about 0.379 to 0.938, with more than 50% of the districts at 0.379.) Even 
though RE nominally accounts for the cluster (district) effect, the nonrobust standard 
errors evidently understate the actual sampling variation. The robust 95% confidence 
interval excludes zero, but it also excludes —1 (and the RE point estimate, —0.381, is 
far from —1). The robust standard error on log(staff), 0.036, is more than twice as 
large as the nonrobust one, 0.015. Again, this finding points to the importance of 
using robust inference even if we nominally account for the common district effect by 
means of random effects estimation. Incidentally, the RE robust standard errors are, 
except in one case, smaller than the robust pooled OLS standard errors, indicating 
that RE is more efficient than POLS even though RE is evidently not the most effi- 
cient estimator (because it appears there is a more complicted pattern of cluster cor- 
relation than accounted for by RE). 

Column (3) in Table 20.1 contains the fixed effects estimates. The coefficient on bs 
is about —0.50, which is still pretty far from —1, and statistically different from —1 
even using the fully robust standard error. Again, allowing for clustering and hetero- 
skedasticity is important for appropriate inference: the usual FE standard errors 
appear to be too small. Because total compensation varies significantly by district, it 
is important to allow the district effects to be correlated with the explanatory vari- 
ables, as FE does. 

Not surprisingly, the fully robust RE standard errors are somewhat below the fully 
robust FE standard errors, a result which makes it tempting to use the RE estimates. 
But the robust Hausman test, obtained by adding the four group averages to the RE 
estimation and testing their joint significance, yields a low p-value, about 0.0004. It 
appears the district effect is systematically related to some of the variables (staff size 
especially), and so the safest strategy is to use the fixed effects estimates with fully 
robust inference. 


The discussion of the previous methods extends immediately to instrumental vari- 
ables versions of all estimators. With large G, one can afford to make pooled two- 
stage least squares (2SLS), random effects 2SLS, and fixed effects 2SLS robust to 
arbitrary within-cluster correlation and heteroskedasticity. Adding the group aver- 
ages of the exogenous explanatory variables (including the extra instruments), esti- 
mating the resulting equation by RE 2SLS (where the group averages act as their 
own instruments), and jointly testing the group averages for significance leads to a 
simple Hausman test comparing RE 2SLS and FE 2SLS. 
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If the random effects variance matrix structure does not hold, more efficient esti- 
mation can be achieved by applying generalized method of moments (GMM); again, 
GMM is justified with large G. 

As we discussed in Section 12.10.3, one might apply least absolute deviations or 
quantile regression directly to equation (20.32). While difficult to justify in general, 
adding the group averages and then applying, say, LAD, can be a useful way to 
approximate the effects of the variables on the median while allowing the group 
heterogeneity to be correlated with the individual-specific covariates. Under the kinds 
of symmetry assumptions discussed in Section 12.10.3, this can be a good way to 
account for outliers in the data. 

For a the case where G is much larger than the group sizes, cluster-robust inference 
is available for nonlinear models, too. A general treatment based on M-estimation is 
possible, but most of the points can be illustrated with binary response models. Let 
Ygm be a binary response, with x, and Zym, m=1,...,M,, g= 1,...,G defined as 
before. Assume that 


Von = 1] + Xgb + ZgmY + Cy + Un = 0], (20.34) 
where c, is the cluster effect and ugm is the unit-specific error. If, say, we assume 


Ugm | X, 


Zg, Cg ~ Normal(0, 1), (20.35) 


g? 


then 
P(Ygm = 1 | Xg, Zgm, €g) = P(Ygm = 1 | Xg, Zg, Cg) = O(a + X48 + Zon? + cg), (20.36) 


where ®(-) is the standard normal (cdf), as usual. Alternatively, if ugm follows a 
logistic distribution, then we replace ®(-) with A(-). Notice that expression (20.35) 
assumes that, conditional on c,, Xg, and Zym, Zgp for p # m does not affect the out- 
come. For pooled methods we could relax this restriction (as in the linear case), but, 
with the presence of c,, this affords little generality in practice. 

As in nonlinear panel data models, the presence of c, in equation (20.36) raises 
several important issues, including how we estimate quantities of interest. As in the 
panel data case, we have some interest in estimating average partial or marginal 
effects. For example, if the first element of x, is continuous, 


OP(Vgm = 1 | Xg, Zgm, Cy) 
0X41 


a pipl + XB + ZgmY F ĉj); (20.37) 


where ¢(-) is the standard normal density function. If 


cg | Xg, Zg ~ Normal(0, 02), (20.38) 
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then the APEs are obtained from the average structural function 


ASF (Xq, 2gm) = P[(c + XgB + Zymy)/(1 + 02) "?] = (He + Xgße + ZgmYe), (20.39) 


where ge = «/(1 +02) 1/2 and so on. Because the right-hand side of equation (20.39) 


is P(Vgm = 1 | X4, Zg), the scaled coefficients are conveniently estimated using pooled 
probit or a generalized estimating equation (GEE) approach. In either case, inference 
must be robust to allow general covariance structures Cov(ygm, Ygp |X, Zg) for 
m # p. These certainly will not be zero (as would be required to ignore the clustering 
using pooled methods), but neither will they be constant. The same formulas used in 
the panel data case apply to cluster samples, with the small change that the group 
sizes are generally different. 

The pooled and GEE approaches are attractive because they are computationally 
simple and do not require specification of a joint distribution within clusters. Alter- 
natively, we can impose more assumptions—as in the panel data case—and use full 
maximimum likelihood (conditional on x, and Z,,,, of course). If we supplement 
assumptions (20.34), (20.35), and (20.38) with 


{ug1, - - - , Ug, m, } are independent conditional on (x,,Z,, cy), (20.40) 


then we have the random effects probit model. Details of its estimation are similar to 
the panel data case, with the minor exception that here we must allow for an unequal 
number of observations per cluster. Because we can separately identify «, $, y, and 
a2, partial effects at various values of c, are identified in addition to the average 
partial effects. In one important way, a random effects approach under the condi- 
tional independence assumption (20.40) is more attractive for cluster samples than 
for panel data: in panel data it is often the case that time series innovations are cor- 
related over time. With a cluster sample, independence of individual outcomes after 
conditioning on a common cluster effect is often more believable. (Nevertheless, we 
are conditioning only on a scalar heterogeneity, cg, so such an assumption may still 
be too restrictive.) 

As with panel data, we often want to allow the cluster heterogeneity, c,, to be 
correlated with the observed covariates. When the cluster sizes are the same—that is, 
M, = M for all g—we can apply the same methods we used for balanced panel, 
including probit, logit, ordered probit, Tobit, count data, fractional responses, and 
so on. Calculations of average partial effects are identical, and random effects 
approaches under conditional independence are attractive. But pooled methods are 
often computationally simpler and are sufficient for identifying APEs. 

A challenging task for CRE approaches using cluster samples is how to model 
correlation between the unobserved heterogeneity and {Zm: m= 1,..., M4} when 
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the clusters are not balanced with respect to cluster size. One possible solution is to 
randomly drop observations from clusters to make them all the same size. Then we 
could apply the usual approach for balanced panel data. Unfortunately, this 
approach can be very costly in terms of lost data. For example, if the smallest group 
has M, = 3, we would have to drop observations from all other groups until we only 
have three in each group. 

The problem with different group sizes is that it is unclear how one should model 
the correlation between cg and (Z,1,...,Zy,m,) for each g. Nevertheless, there are 
several possibilities. We can get some insight by assuming joint normality of (cy, 
Zgi,++-,Zg,M,) and then assuming that E(cg | 2g1,...,2g,u,) = E(¢g | Zg) = 14 + ZgSy; 
that is, assuming Z, is a sufficient statistic for the mean. Then it must follow that 


č, = [Var(z,)| Cov(Z,,¢,) 


Ng = E(cg) — E(Z4)&g- 


For the sake of argument, assume that {Z}m: m = 1,..., Mọ} has an unobserved 
effects structure, that is, Zgm =¥y + €gm Where rg is uncorrelated with each egm and 
{€gm: M = 1,..., Mg} are pairwise uncorrelated with zero mean and common vari- 
ance matrix Łe. Then E(Z,) = 4, and Var(Z,) = X, + M} I£.. Assume that cy is 
uncorrelated with the eym, and let 6,, be the vector of covariances of r, with cg. Then 


by = (Er + My Ze) ‘Ore (20.41) 
Ng = He — HrŠg» (20.42) 
where u, = E(cg). Further, if we write cy = 9, + Zgčy + ay, 

Var(ay) = 02 — 6},(Zr + M} 'Ze) ‘ore. (20.43) 


These calculations show that, even under fairly strong assumptions, both E(c, |Z,) 
and Var(c,|Z,) depend on the group size, M, (and E(c,|Z,) depends on Z,, too). If 
x, = 0, the mean and variance depend on M, and M; -Z,. If £, is “large” (in a ma- 
trix sense), or M, is large, the mean and variance are almost independent of M, (but 
the mean is a linear function of Z,). If &, and LZ, are both scalar multiples of the 
identity matrix, the function of M, has the form (o? + M} 'g2)~', which, for large 
M,, can be approximated well by a low-order polynomial in M7 5 

How should we apply these calculations for the conditional mean and variance of 
cy? First, we should recognize that they are derived under strong assumptions, so we 
should not use such specific forms. (Plus, they would not be very easy to handle 
computationally.) An approach that may be flexible enough is 
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Cg = Wo + Yı My + Zyy + (My - Zg) 3 + ag (20.44) 
Var(dg|2g1,---,2g,m,) = Var(ag) = o + aM, (20.45) 


(where we expect œ < 0 because the conditional variance of c, shrinks as the num- 
ber of explanatory variables increases). If we use these expressions in place of the 
usual Chamberlain-Mundlak approach (and include x,, too), we get the following 
estimating equation: 


(x + XgB + Zgmy + Yı Mg + ZgWo + (Ma 2) 3) 
y 1+ o+ oM 


In principle, the parameters here can be estimated by, say, a pooled heteroskedastic 
probit analysis. To estimate the parameters, a normalization is needed on the vari- 
ance (because when œ = 0, only the parameters scaled by 1/,/1 + œo are identified). 
In fact, using modern software that allows for exponential forms of hetero- 
skedasticity in probit anlaysis, an easy way to estimate the identified parameters, and 
then obtain average partial effects, is to specify the variance as exp[d) log(M,)]. 
When ô; = 0, the variance is one, and we estimate the scaled coefficients. Notice that 
specifying the composite variance as exp[d; log(M,)| also has the benefit of nesting 
the cases where Var(a,) is a linear function of My or M,'. 

A more flexible approach is to let the conditional variance of c, be as flexible as the 
conditional mean, but still nesting the preceding simple functional form. For exam- 
ple, a more flexible estimating equation is 


P( Ym = 1 | Xg, Z) = © (20.46) 


(æ + XgB + Zgmy + Yı My + Zgpa + (My - 2) 3) 


P gm = 1 gy Zg =@ N z = 
(Vo [Xg Zg) exp(ô; log(M,) + Z,62 + log(M,)z,63) 


(20.47) 


We could replace M, in the mean part with M; ' or even use both functions. Such 
an equation is relatively straightforward to estimate using heteroskedastic probit 
software. 

A very attractive alternative with large G and not much variation in the group sizes 
M; is to allow a different set of parameters in D(cy | Mg, Zg) for each value of M,. 
This is easily accomplished by including dummies for all but one group size and also 
interacting the dummies with Z, in the mean and the variance. In equation (20.43) the 
variance depends only on M,, and so one might want to simplify the estimation by 
including only the group-size dummies in the variance. 

Regardless of the specific expression we use for P(ygm = 1|x,,Z,), it is straight- 
forward to estimate the average partial effects. The conditioning variables that we 
must average out are (M,,Z,), and we use, as usual, the discussion in Section 2.2.5. 
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Let m(Xy,Zgm,My,Z,,9) be the response probability P( gn = 1|x,,Z,), where 0 is 
the set of all parameters. Then the ASF, for fixed values of x and z, is consistently 
estimated as 


G 
ASF( (x, Z) ei m(x,Z, My ,Z,, Ô); (20.48) 


that is, we average out (M,,Z,). (As usual, one must use caution in interpreting the 
effects of the group-level variables if these are partially correlated with c,.) 

Incidentally, the methods proposed here can be applied to unbalanced panel data 
sets, assuming, of course, that the reason the panel is unbalanced can be ignored. 
With a large cross section, N (which replaces G), and a small number of time periods, 
T; (which replaces M,) for each observation i, the flexible approach of allowing dif- 
ferent parameters for each 7; is attractive. 

Rather than adopt a correlated random effects probit approach, we can apply the 
fixed effects logit approach, assuming that the observations within a cluster are 
independent conditional on the observed covariates and the cluster effect, cy. Natu- 
rally, the cluster-level variables, x,, are eliminated, and one can only estimate 
parameters, not partial effects. The mechanics are essentially identical to the panel 
data case. Geronimus and Korenman (1992) use sister pairs to study the effects of 
teenage motherhood on subsequent economic outcomes, so M, = 2 for all g. When 
the outcome is binary (such as an employment indicator), the authors apply fixed 
effects logit. CRE probit can also be used to obtain the magnitudes of the effects. 

The same CRE approach can be applied to other nonlinear models, such as 
ordered probit and Tobit models. Generally, if we begin with a density f(y,» | Xg, Zgm, 
€g; 9), where both y,,,, and cy can be vectors, and then specify a heterogeneity density, 
say, h(¢y | M4, Z4; ô), a partial MLE analysis can be obtained by “integrating out” cy 
to get the density 


ce | Xp, Zam, €; O)h(€ | My, Zg; Ô) de. (20.49) 


As we know from the panel data case, this density has a simple form for common 
models, such as Tobit, when c, is a scalar and h(-| M,,Z,;6) is chosen to have a 
simple form, such as normal. However, as for the probit case, one should allow het- 
eroskedasticity in Var(c, |M,,Z,), leading to a pooled estimation strategy based on 
the “heteroskedastic Tobit” model. We also know that using pooled (that is, partial) 
MLE does not always fully identify the parameters, but it does often identify scaled 
parameters and average partial effects. Because the observations within clusters are 
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almost certainly correlated, even after conditioning on (x4, Z4), inference that allows 
within-cluster correlation is crucial. 

Various quasi-MLEs can also be adapted to account for correlated random effects 
in the context of cluster sampling. In the exponential case, we would be led to a mean 
function that looks like, say, 


E( Yam | Zy, Mg) = E(Ygm | Zgm, My, Zg) = exp(x,f + Zgmy + o&M, + ZW m,)» (20.50) 


where «m, and yyy, are specific to the group size, Mg (or we use linear or low-order 
polynomials in M, or M3 ' or both). The parameters can be estimated by, say, 
pooled Poisson QMLE, or GEE using the Poisson distribution by including a full set 
of group-size dummies along with Z} and interactions with z,. The elements of (£, y) 
measure semielasticities or elasticities on the mean response. The APEs on the mean 
are obtained by averaging out (M,,Z,). See Problem 20.8 for the case of a fractional 
response. 

As in the panel data case, the fixed effects Poisson estimator is very convenient 


when we start with 


El Vom | Zg, tg) = E( Vgm | Zgm, €g) = XP Bom? + cg). (20.51) 


With arbitrarily unbalanced group sizes, the FE Poisson estimator (viewed as a 
quasi-MLE) consistently estimates y (and x, is eliminated). No other feature of the 
Poisson distribution needs to be correctly specified, and sources of within-cluster 
correlation other than c, are allowed (provided, of course, we use fully robust 
inference). 


20.3.2 Cluster Samples with Unit-Specific Panel Data 


Often, cluster samples come with a time component, so that there are two potential 
sources of correlation across observations: across time within the same individual and 
across individuals within the same group. The two sources of correlation may also 
interact: different individuals within the same group or cluster might have unobserved 
shocks correlated across different time periods. 

Generally, accounting for more than two data dimensions is complicated if there is 
not a natural nesting. Here we consider the case where each unit belongs to a cluster 
and the cluster identification does not change over time. In other words, we have 
panel data on each individual or unit, and each unit belongs to a cluster. For exam- 
ple, we might have annual panel data at the firm level where each firm belongs to the 
same industry (cluster) for all years. Or, we might have panel data for schools that 
each belong to a district. This is a special case of a hierarchical linear model (HLM) 
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setup. Models for data structures involving panel data and clustering are also called 
mixed models (although this latter name typically refers to the situation, which we 
treat later, in which some slope parameters are constant and others are unobserved 
heterogeneity). In the HLM/mixed models literature, more levels of nesting are 
allowed, but we will not consider more general structures; see, for example, Rau- 
denbush and Bryk (2002). 

Now we have three data subscripts on at least some variables that we observe. For 
example, the response variable is yg, where g indexes the group or cluster, m is the 
unit within the group, and ¢ is the time index. Mainly for expository purposes, 
assume we have a balanced panel with the time periods running from ¢=1,...,7. 
Within cluster g there are M, units, and we have sampled G clusters. (In the HLM 
literature, g is usually called the first level and m the second level.) 

As with a pure cluster sample, we assume that we have many groups, G, and rela- 
tively few members of the group. Further, our discussion of asymptotic properties 
of estimators assumes that T is fixed. In particular, the analysis is with the M, and 
T fixed with G getting large. For example, if we can sample, say, several hundred 
school districts, with a few to maybe a few dozen schools per district, over a handful 
of years, then we have a data set that can be analyzed in the current framework. 


A standard linear model with constant slopes can be written, for t= 1,...,T, 
m=1,...,M,, and a random draw g from the population of clusters as 
Yomi = Ni + Wg + XgmP + Zgmô + hg + Cmg + Ugmt, (20.52) 


where, say, hg is the industry or district effect, Cym is the firm effect or school effect 
(firm or school m in industry or district g), and Wg», is the idiosyncratic effect. In other 
words, the composite error is 


Vgmt = hy aF Cgm F Ugmt- (20.53) 


Generally, the model can include variables that change at any level. In equation 
(20.52), some elements of Zym: might change only across g and f, and not by unit. This 
is an important special case for policy analysis where the policy applies at the group 
level and changes over time. In such cases it is crucial for obtaining correct inference 
to recognize the cluster correlation. In effect, if one has observables in the model 
measured at the group level (whether or not they change over time), it is effectively 
cheating to then assume there are no group-level unobservables affecting yn. This 
could be the case, but one should not assume it from the outset. 

A simple estimation method, assuming Vgm: is uncorrelated with (Wy, Xgm,Zgmr), iS 
pooled OLS, which is consistent as G — oo for any cluster or serial correlation pat- 
tern. The most general inference for pooled OLS—maintaining independence across 
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clusters—is to allow any kind of serial correlation across units or time, or both, 
within a cluster. 

Not surprisingly, one can apply a generalized least squares analysis that makes 
assumptions about the components of the composite error. Typically, it is assumed 
that the components are pairwise uncorrelated, the cg, are uncorrelated within clus- 
ter (with common variance), and the ugm are uncorrelated within cluster and across 
time (with common variance). The resulting feasible GLS estimator is an extension of 
the usual random effects estimator for panel data. Because of the large-G setting, the 
estimator is consistent and asymptotically normal whether or not the actual variance 
structure we use in estimation is the proper one. To guard against heteroskedasticity 
in any of the errors and serial correlation in the {ugm}, one should use fully robust 
inference that does not rely on the form of the unconditional variance matrix (which 
may also differ from the conditional variance matrix). 

Simple strategies are available, too. For example, one can apply random effects 
at the individual level, effectively ignoring the clusters in estimation. In other words, 
treat the data as a standard panel data set in estimation. Such an estimator might be 
more efficient than pooled OLS yet easier to obtain than a complete GLS analysis 
that also accounts for the cluster sampling. To account for the cluster sampling in 
inference, one computes a fully robust variance matrix estimator for the usual ran- 
dom effects estimator. Many statistical packages have options to allow for clustering 
at a higher level of aggregation than the level at which random effects is applied. 

More formally, write the equation for each cluster as 


Ya = RO + vy, (20.54) 


where a row of R, is (1,d2,...,dT, Wg, Xgm,Zgmr) (which includes a full set of period 
dummies) and @ is the vector of all regression parameters. For cluster g, y, contains 
M,T elements (T periods for each unit m). In particular, 


Ygl Vgmi 
Yg2 Vgm2 

Yo=] - [> Yom=] . p (20.55) 
Yg, My YgmT 


so that each y,,, is Tx l; vọ has an identical structure. Now, we can obtain 
Q, = Var(v,) under various assumptions and apply feasible GLS. 

Random effects estimation at the unit level is obtained by choosing Q, = Ly, @ A, 
where A is the T x T matrix with the RE structure. Of course, if there is within- 
cluster correlation, this is not the correct form of Var(v,), and that is why robust 
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inference generally is needed after RE estimation. Generally, to allow for an incor- 
rect structure imposed on Q,, or to allow for system heteroskedasticity, that is, 
Var(v,|R,) # Var(v,), we use fully robust inference. In particular, the robust 
asymptotic variance of 6 is estimated as 


z -1/ g -1 / G -j 
Avar(ĝ) = (>. R,9,',) (>. RA, ,9,05'R (>. Rà, 'r, , (20.56) 
g=1 g=1 g=1 
where ¥, = y, — Rô. Some software packages that allow cluster-robust inference 
after panel data estimation compute this fully robust asymptotic variance. Unfortu- 
nately, routines intended for estimating HLMs (or mixed models) often assume that 
the structure imposed on Q, is correct, and that Var(v, | R,) = Var(v,). The resulting 
inference could be misleading, especially if serial correlation in {ugm} is not allowed. 
Because of the nested data structure, we have available different versions of fixed 
effects estimators. Subtracting cluster averages from all observations within a cluster 
eliminates hj; when wy; = wy for all ¢, w, is also eliminated. But the unit-specific 
effects, Cmg, are still part of the error term. If we are mainly interested in 6, the coef- 
ficients on the time-varying variables Zm, then removing Cym (along with hg) is 
attractive. In other words, use a standard fixed effects analysis at the individual level. 
(If the units were allowed to change groups over time, then we would replace hg with 
hg, and then subtracting off individual-specific means would not remove the time- 
varying cluster effects.) 


Example 20.4 (Effects of Spending on School Performance): The data in 
MEAP94_98, which are a subset of those used in Papke (2005), contain school-level 
panel data on student performance and per-pupil spending. The variable to be 
explained, math4, is the percentage of students receiving a satisfactory score on a 
fourth-grade math test administered by the state of Michigan. The key variable, 
lavgrexp = log(avgrexp), is the log of average real per-pupil spending for the current 
and previous year. The data set is for the years 1994 through 1998; it is unbalanced, 
with schools having either three, four, or five years of data (in various patterns). The 
other school-level controls are enrollment, in logarithmic form (/enro//), and the 
percentage of students eligible for the free lunch program (/unch). A full set of year 
dummies is also included. 

We can view this as a cluster sample because schools are nested within districts. 
Certainly much of the variability in spending is across districts, and so it may be 
important to allow for district-level effects. 

The results of fixed effects estimation, at the school level, are given in Table 20.2. 
Because schools are in the same district in every year, eliminating a school effect also 
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Table 20.2 
Fixed Effects Estimation of Spending on Test Pass Rates 
Dependent Variable math4 
FE Usual FE S.E. Clustered S.E. Clustered 
Coefficient Standard Error by School by District 
Explanatory Variable 
log(avgrexp) 6.29 2.10 2.43 3.13 
lunch —0.022 0.031 0.039 0.040 
log(enro/) —2.04 1.79 1.79 2.10 
y95 11.62 0.55 0.54 0.72 
y96 13.06 0.66 0.69 0.93 
y97 10.15 0.70 0.73 0.96 
y98 23.41 0.72 0.77 1.03 
Number of districts 467 
Number of schools 1,683 


removes any additive district effect. But within-district correlation can be present if 
some of the slopes change by district. Along with the FE estimates, three standard 
errors are provided: the usual FE standard errors that ignore serial correlation and 
within-district correlation; the standard errors that are robust to arbitrary serial cor- 
relation within school but assume no correlation across schools within a district; and 
the most robust standard errors that allow within-district correlation across schools 
and time periods. (Remember that we are assuming independence across districts; 
without this assumption, proper inference becomes much more difficult.) 

The FE estimate of Byargrexy IS about 6.29, which means that a 10% increase in 
average real spending is estimated to increase the pass rate by about 0.63 percentage 
points. Using the usual FE standard error, about 2.10, the ¢ statistic for — is 
about 3.0. Therefore, the effect of spending is statistically significant at a low signifi- 
cance level (about 0.3%) using the usual FE inference. The standard error that allows 
arbitrary serial correlation within schools (and heteroskedasticity, too) is higher, 
about 2.43. Naturally, this reduces the statistical significance of Piren: The standard 
error in column (4) is subtantially higher, about 3.13. Allowing for within-district 
correlation and serial correlation has practically important effects on the uncertainty 
associated with the estimate. Using the fully robust standard error, the 95% confi- 
dence interval for Bjaygrexy €XCludes zero, but only barely. 


If the model is given by equations (20.52) and (20.53), the unit-specific time 
demeaning eliminates all cluster correlation, and the inference need only be made 
robust to neglected serial correlation in {ugm}. But we might want to use cluster- 
robust inference anyway to allow for more general situations. Suppose the model is 
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Ygmt = Nt oe WG = Xgmß pz Zgmtdmg = hy aa Cmg F Ugmt 
=M; T Woh cr Xgmß T Zgmt + hy + Cmg Eg Ugmt Eg Zgmt€gm; (20.57) 


where dym = Ô + egm is a set of unit-specific intercepts on the individual, time-varying 
covariates Zyn,. The time-demeaned equation within individual m in cluster g is 


Vgmt = Yam = ĉi T (Zgmt ~ Zgm)O + (gmi ~ itgm) = (Zgmt E Zgm)€gm- (20.58) 


Because the eym are generally correlated across units within cluster g, the last term 
generally induces cluster correlation of a heteroskedastic nature within cluster g. 
From our discussion in Section 11.7.3, we know that FE is still consistent if 
E(ding | Zgmt — Zgm) = E(ding), m= 1,..., Mg, t= 1,..., T, and all g, and so cluster- 
robust inference, which is automatically robust to serial correlation and hetero- 
skedasticity, makes perfectly good sense. 

An important feature of the HLM approach is the possibility of allowing the slopes 
to depend on observed covariates. Often one begins with a model at the unit—time- 
period level that contains heterogeneity, and then allows the intercept and slopes to 
depend on higher-level covariates. Write a model for unit m at time ¢ in cluster g as 


J gmt = Zgmtdgm + Ugmt, (20.59) 
and then decompose the idiosyncratic error, Vgm, aS 
Vgmt = "1 T Cgm F Ugmt, (20.60) 


where the 7, are aggregate time effects. For notational simplicity, we absorb the 
group effect, Ay, into Ugmr, and allow Cgm and ugm to be correlated within group. For 
each (g,m) define 


gm = (Wg, Xy, Xgm, Zom), 


a _ M, _ d T = : 
where X, = M; l pat Xyp and Zgm = T S| Zgms- In other words, Fyn includes the 


group-level covariates along with group averages of the unit-specific covariates, the 
unit-specific covariates, and the time averages of the covariates that change over 
time. Now assume 


Com = % F Tym) F Agm (20.61) 


dym =ġ + T(E gm = Hy) T Cgm; (20.62) 
insert these in the equation, and use basic algebra: 


Ygmt = Cr Rg Tym) F ZgmtO + [Kom ~~ H;) & Zgmi] =F Agm F Zgmt€gm F Ugmt, 
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where x = vec(II). Importantly, centering Fgm about its average before forming the 
interactions means that ô is the average partial effect. If we instead use Tym ® Zgmr, the 
coefficients on the level terms, Zym, may be of little interest because they measure 
the effects of the Zym: when Fgm is zero, which is unlikely to be an interesting segment 
of the population. In practice, the population mean, 4, is replaced with the sample 
average across m and g. The presence of Zyn;@gm in the error term, as well as potential 
serial correlation in {ugm}, makes a genuine GLS analysis difficult but possible under 
simple structures for the variance-covariance matrices. But we can use any of the 
simpler strategies mentioned earlier. For example, we can act as if ey, = 0 and as if 
Ugmt is serially uncorrelated in estimation. We can apply random effects to account 
for a cluster-level effect or RE at the individual level, or both. Basically, we include 
cluster-level variables, averages of unit-specific, time-constant variables, and time 
averages of the variables that change over time along with the unit-specific variables. 
For added flexibility, we include a full set of interactions. Regardless of the specifics, 
we use fully robust inference. 

A very similar discussion holds in the context of instrumental variables. Suppose 
we start with the model 


Ygmt = Nt T Ygm9 + Ugmt, (20.63) 


where rom contains all covariates and Vgm is the composite error. If we have exoge- 
nous variables, say q,,,,,, such that E(q/,,,,Ugm) = 0 and the rank condition holds, then 
pooled 2SLS is attractive for its simplicity. It does not matter whether elements of 
Ygmt OF Gyr, Contain elements that change only across g, across g and m, across g and 
t, or across g, m, and ¢, provided the rank condition holds. Without further assump- 
tions, the 2SLS variance matrix estimator, as well as inference generally, should be 
robust to arbitrary serial correlation and cluster correlation at the most aggregated 
level. For example, if g indexes counties and m indexes manufacturing plants oper- 
ating within a county, then we should cluster at the county level. We may have a 
policy and instruments that change only at the county level over time, along with 
exogenous explanatory variables that change at the plant level (either constant or 
over time). In evaluating whether the rank condition holds—say, for a single endog- 
enous variable wy,,—one can use a pooled OLS regression Wgm on 1, d2;,...,dT;, 
Qgmz (assuming that q,,,, contains all exogenous variables in equation (20.63)) to test 
for joint significance of the proposed instruments in qym Naturally, such a test 
should be made robust to arbitrary cluster and serial correlation to be convincing. 
The test works even if Wyn; does not change across m (or even ¢ for that matter), and 
the same with q,,,,,. The inference is valid with large G provided it is made fully 
robust. 


gmt" 
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In the previous scenario, if we apply, say, fixed effects 2SLS, where we eliminate a 
time-constant, plant-level effect, then we need the variables of interest to at least 
change over time (if not across m); the same is true of the instruments. If we have 
instruments that change only by g, the FE2SLS estimator—whether we remove a 
county-level or plant-level effect—does not identify 0. 


20.3.3 Should We Apply Cluster-Robust Inference with Large Group Sizes? 


Until recently, the “cluster-robust”’ standard errors and test statistics obtained from 
pooled OLS, random effects, and fixed effects were known to be valid only as G — œ 
with each M, fixed. As a practical matter, that fact means that one should have lots 
of small groups. Recently, because of the structure of many commonly used cluster 
samples, researchers have become interested in the performance of cluster-robust 
inference when the number of groups, G, is not substantially larger than the typical 
group size, Mọ. 

Consider the basic model without a time structure, for simplicity, and consider 
formula (20.25), the asymptotic variance for pooled OLS. With a large number of 
groups and small group sizes, we can get good estimates of the within-cluster 
correlations—technically, of the cluster correlations of the cross products of the 
regressors and errors—even if they are unrestricted, and it is for that reason that the 
robust variance matrix is consistent as G — oo with M, fixed. In fact, in this scenario, 
one loses nothing in terms of asymptotic local power (with local alternatives shrink- 
ing to zero at the rate G~'/*) if c, is not present. In other words, based on first-order 
asymptotic analysis, there is no cost to being fully robust to any kind of within-group 
correlation or heteroskedasticity. These arguments apply equally to panel data sets 
with a large number of cross sections and relatively few time periods, whether or not 
the idiosyncratic errors are serially correlated, and to the cluster sample/panel data 
setting considered in Section 20.3.2. 

What if one applies robust inference in scenarios where the fixed M}, G— œ% 
asymptotic analysis is not realistic? Hansen (2007) has recently derived properties of 
the cluster-robust variance matrix and related test statistics under various scenarios 
that help us more fully understand the properties of cluster-robust inference across 
different data configurations. Hansen (2007, Theorem 2) shows that, with G and M, 
both getting large, the usual inference based on equation (20.25) is valid with arbi- 
trary correlation among the errors, Vgm, within each group. Because we usually think 
of Vgm as including the group effect c,, this means that, with large group sizes, we can 
obtain valid inference using the cluster-robust variance matrix, provided that G is 
also large. So, for example, if we have a sample of G = 100 schools and roughly 
M, = 100 students per school, and we use pooled OLS leaving the school effects 
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in the error term, we should expect the inference to have roughly the correct size. 
Probably we leave the school effects in the error term because we are interested in a 
school-specific explanatory variable, perhaps indicating a policy change. Adding a 
short time dimension does not change these conclusions. 

Unfortunately, pooled OLS with cluster effects when G is small and group sizes are 
large falls outside Hansen’s theoretical findings: the proper asymptotic analysis 
would be with G fixed, M} — œ, and persistent within-cluster correlation (because 
of the presence of cg in the error) causes problems in such cases. Consequently, we 
should not expect good properties of the cluster-robust inference with small groups 
and very large group sizes when cluster effects are left in the error term. As an 
example, suppose that G = 10 hospitals have been sampled with several hundred 
patients per hospital. If the explanatory variable of interest is exogenous and varies 
only at the hospital level, it is tempting to use pooled OLS with cluster-robust infer- 
ence. But we have no theoretical justification for doing so, and we have reasons to 
expect it will not work well—including the simulations in Hansen (2007). 

If the explanatory variables of interest vary within group—say, within each hos- 
pital a subset of patients were provided with a specific kind of care—fixed effects is 
attractive for a couple of reasons. The first advantage is the usual one about allow- 
ing cy to be arbitrarily correlated with the z,,,. The second advantage is that, with 
large M,, we can treat the c, as parameters to estimate—because we can estimate 
them precisely—and then assume that the observations are independent across m 
(as well as g). Therefore, the usual inference is valid, perhaps with adjustments for 
heteroskedasticity. 

In summary, for true cluster sample applications, cluster-robust inference using 
pooled OLS delivers statistics with proper size when G and M, are both moderately 
large, but they should probably be avoided with large M, and small G. We will dis- 
cuss some approaches for handling a small number of groups in Section 20.3.4. 


20.3.4 Inference When the Number of Clusters Is Small 


If the explanatory variable or variables of interest do not change within cluster and 
the number of clusters is small, none of the previous methods can be used for reliable 
inference. Fixed effects eliminates the key variables, while for pooled OLS we are not 
justified in using cluster-robust inference. (Whether a random effects analysis pro- 
duces valid inference with small G and large M, appears to be an open, and very 
interesting, question.) 

The problem of proper inference when M, is large relative to G was brought to 
light by Moulton (1990), who was interested in studying data on individuals clustered 
at the state level in the United States. He proposed corrections to the usual OLS 
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standard errors that impose more structure than the usual cluster-robust standard 
errors studied by Hansen (2007). Either way, the corrections to the usual OLS infer- 
ence tend to work well provided the M, are not too much bigger than G. In this 
subsection we are interested in cases where a large G analysis makes no sense. 

Often with small G and large M, the sampling scheme more resembles that of 
standard stratified sampling, but without requiring a complete partition of the pop- 
ulation. In other words, a small set of populations are defined, and then random 
samples are obtained from those populations. As an example, a random sample of 
adults is obtained from each of a handful of cities, some of which received federal aid 
for a job-training program. Labor market outcomes are recorded, possibly including 
changes from an early time period. In this scenario, we could analyze the data as 
independent outcomes across and, more importantly, within group. We will return to 
this point. 

Recent work by Donald and Lang (2007) (hereafter, DL) treats the small G case 
within the context of cluster sampling. That is, presumably from a large population 
of clusters, only a handful or so are drawn (and then we may or may not sample 
every unit within each cluster). As mentioned in the previous subsection, such a sce- 
nario causes problems for cluster-robust inference. Therefore, DL propose a different 
approach. 

Before we cover the DL approach, it is important to understand that the structure 
of data sets in the small G case is the same whether we think of drawing a small 
number of clusters from a large population or fixing a few clusters and then drawing 
large random samples from them. Unfortunately, how one proceeds is dependent on 
how we view the sampling scheme. As we will see, the DL approach is typically much 
more conservative than the standard approach. 

To illustrate the issues considered by DL, consider the simplest case, with a single 
regressor that varies only by group: 


Ygm = & pa pxg oe Cg F Ugm (20.64) 
= 0g + BXg + Ugm, m= lysg My g= Paneer Sa (20.65) 


Notice how equation (20.65) is written as a model with common slope, £, but inter- 
cept, d,, that varies across g. Donald and Lang focus on equation (20.64), where cy is 
assumed to be independent of x, with zero mean. They use this formulation to high- 
light the problems of applying standard inference to equation (20.64), that is, acting 
as if cy is absent. We know this is a bad idea even in the large G, small M, case, as 
it ignores the persistent correlation in the errors within each group. Unfortunately, 
while Hansen (2007) has shown that cluster-robust inference is valid with large G, 
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even if the M, are also large, it is not valid when G is small. Thus other approaches 
are needed. 

One way to see the problem in applying standard inference is to note that when 
M,=M for all g=1,...,G, the pooled OLS estimator, B, is identical to the 
“between” estimator obtained from the regression 


J, on lx g=l,...4.G. (20.66) 


Conditional on the x,, the estimator B inherits its distribution from {i,: g =1,..., 
G}, the within-group averages of the composite errors Vgm = Cg + Ugm. The presence 
of cg means new observations within group do not provide additional information for 
estimating f beyond how they affect the group average, J}. In effect, we only have G 
useful pieces of information. 

If we add some strong assumptions, there is a solution to the inference problem. In 
addition to assuming M, = M for all g, assume cy |x, ~ Normal(0,¢2) and assume 
Ugm | Xg, Cg ~ Normal(0, o2). Then 6, is independent of x, and 6, ~ Normal(0, o2 + 
a2 /M) for all g. Because we assume independence across g, the equation 


Yyg=a+Pxgti,, g=1,...,G (20.67) 


satisfies the classical linear model assumptions. Therefore, we can use inference based 
on the ¢g_» distribution to test hypotheses about f, provided G > 2. When G is very 
small, the requirements for a significant ¢ statistic using the tg_2 distribution are 
much more stringent than if we use the ¢y4,+.4,+4...4M@¢—2 distribution—which is what 
we would be doing if we used the usual pooled OLS statistics. 

When x, is a 1 x K vector, we need G > K + 1 to use the t-g- distribution for 
inference. (In Moulton (1990), G = 50 states and x, contains 17 elements.) 

As pointed out by DL, performing the correct inference in the presence of cy is not 
just a matter of correcting the pooled OLS standard errors for cluster correlation— 
something that does not appear to be valid for small G, anyway—or using the RE 
estimator. In the case of common group sizes, there is only estimator: pooled OLS. 
Random effects and between regression in equation (20.66) all lead to the same Ê. 
The regression in equation (20.66), by using the tg-2 distribution, yields inference 
with appropriate size. 

We can apply the DL method without normality of the ugm if the common group 
size M is large: by the central limit theorem, üq will be approximately normally dis- 
tributed very generally. Then, because cg is normally distributed, we can treat 0, as 
approximately normal with constant variance. Further, even if the group sizes differ 
across g, for very large group sizes a, will be a negligible part of 6,: Var(i,) = o2 + 
a2/M,. Provided c, is normally distributed and it dominates i,, a classical linear 
model analysis on equation (20.67) should be roughly valid. 
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The broadest applicability of DL’s setup occurs when the average of the idiosyn- 
cratic errors, a, can be ignored—either because a? is small relative to a2, M, is large, 
or both. In fact, applying DL with different group sizes or nonnormality of the tgm is 
identical to ignoring the estimation error in the sample averages, J. In other words, 
it is as if we are analyzing the simple regression “, = « + Bxg + cg using the classical 
linear model assumptions (where we then insert J, in place of the unknown group 
mean, 4, and ignore the estimation error). With small G, we need to further assume 
that c, is normally distributed. 

If Zym appears in the model, then we can use the averaged equation 


Jy =% + Xgß + Zyy + Uy, g= lyases, G; (20.68) 


provided G > K + L+ 1. If cy is independent of (x,,Z,) with a homoskedastic nor- 
mal distribution and the group sizes are large, inference can be carried out using the 
tg-Kx-L-1 distribution. 

The DL solution to the inference problem with small G is pretty common as a 
strategy to check robustness of results obtained from cluster samples, but often it is 
implemented with somewhat large G (say, G = 50). Often with cluster samples one 
estimates the parameters using the disaggregated data and also the averaged data. 
When some covariates vary within cluster, using averaged data is generally ineffi- 
cient, but when estimating equation (20.68) we need not make standard errors robust 
to within-cluster correlation. We now know that if G is reasonably large and the 
group sizes not too large, the cluster-robust inference applied to the disaggregated 
data can be acceptable. As pointed out by DL, with small G one should use the group 
averages in a classical linear model analysis. 

For small G and large M,, inference obtained analyzing equation (20.67) as a 
classical linear model will be very conservative in the absence of a cluster effect. Thus 
the DL approach can be used in situations where one requires very strong statistical 
evidence for the effect of a policy. Nevertheless, the DL approach rules out some 
widely used staples of policy analysis. For example, suppose we have two populations 
(maybe men and women, two different cities, or a treatment and a control group) 
with means 44, g = 1,2, and we would like to obtain a confidence interval for their 
difference. In almost all cases, it makes sense to view the data as being two random 
samples, one from each subgroup of the population. Under random sampling from 
each group, and assuming normality and equal population variances, the usual 
comparison-of-means statistic is distributed exactly as ty,:,-2 under the null 
hypothesis of equal population means. (Or, we can construct an exact 95% confi- 
dence interval of the difference in population means.) With even moderate sizes for 
Mı and Mh, the ty,+1,-2 distribution is close to the standard normal distribution. 
Also, we can relax normality to obtain approximately valid inference, and it is easy 
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to adjust the ¢ statistic to allow for different population variances. With a controlled 
experiment, the standard difference-in-means analysis is often quite convincing. Yet 
we cannot even study this estimator in the DL setup because G = 2. 

Donald and Lang criticize Card and Krueger (1994) for comparing mean wage 
changes of workers at a sample of fast-food restaurants across two states because 
Card and Krueger fail to account for the state effect (New Jersey or Pennsylvania), 
Cg, in the composite error, Vgm. It is important to remember that the DL criticism of 
the standard difference-in-differences estimator has nothing to do with whether the 
increase in the minimum wage in New Jersey (in April 1992) was an exogenous event: 
DL’s framework assumes that xg, which is an indicator for whether a fast-food res- 
taurant is in New Jersey, is independent of the state effect, c,. Rather, DL’s criticism 
only concerns inference. (Card and Krueger find a positive, not a negative, employ- 
ment effect of increasing the minimum wage, so having a confidence interval seems to 
be less important in this particular case.) 

To further study the G = 2 case with a binary policy indicator, write the difference 
in means as 


Hy — fy = (62 + B) — 6) = (a+ & +p) — (a+ c1) = B+ (2 - c). (20.69) 


Under the DL assumptions, c2 — cı has mean zero, and so including it as part of the 
estimate, which is J) — yı, does not result in bias. The authors work under the 
assumption that J is the parameter of interest, but, if the experiment is properly 
randomized—as is maintained by DL—it is harmless to include the c} in the esti- 
mated effect, in which case the standard comparison-of-means methodology, using 
large M, asymptotics, is appropriate. 

Consider now a case where the DL approach to inference can be applied. Assume 
G = 4 with groups 1 and 2 the control groups (x; = x2 = 0) and groups 3 and 4 the 
treatment groups (x3 = x4 = 1). The DL approach would involve computing the 


averages for each group, J}, and running the regression J} on 1, xg, g=1,...,4. 
Inference is based on the ż distribution. The estimator £ in this case can be written as 
B= (Js + 4)/2 — (I, + J2)/2. (20.70) 


(The pooled OLS regression using the disaggregated data results in the weighted 
average (p373 + paY4) — (P1Jı + P272), where pı = Mi/(Mi+ M2), po = M2/ 
(Mı + M2), ps = M3/(M34+ M4), and ps = Ma/(M3+ M4) are the relative pro- 
portions within the control and treatment groups, respectively.) With B written as 
in equation (20.70), we are left to wonder why we need to use the ft) distribution 
for, say, constructing a confidence interval. Each y, is usually obtained from a large 
sample—M, = 30 or so is usually sufficient for approximate normality of the stan- 
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dardized mean—and so B, when properly standardized, has an approximate standard 
normal distribution quite generally. 

In effect, the DL approach rejects the usual large-sample confidence interval based 
on group means because it may not be the case that 4 = M and 4, = py. In other 
words, the control groups may be heterogeneous, as might be the treatment groups. 
This possibility in itself does not invalidate standard inference applied to equation 
(20.70). In fact, if we define the object of interest as 


T = (43 + My) /2 — (41 + Mp)/2, (20.71) 


which is an average treatment effect of sorts, then f is consistent for $ and (when 
properly scaled) asymptotically normal as the M, get large. 

The previous example suggests a different way to view the small G, large M, setup. 
In this particular setup, we are estimating two parameters, « and J, given four 
moments that we can estimate with the data. The OLS estimates from y, on 1, xy, 
g=1,...,G, are minimum distance (MD) estimates that impose the restrictions 
[4 = lh =a and py = u4 = «+ P. In particular, using the 4 x 4 identity matrix as the 
weight matrix, we get Ê as in equation (20.70) and â = (J, + y,)/2. Using the MD 
approach, we see there are two overidentifying restrictions, which are easily tested. 
But even if we reject them, it simply implies that at least one pair of means within 
each of the control and treatment groups differs. If, say, we have four cities and 
random samples of workers from each city, x, = 1 indicates a job-training program 
in two of the four cities, and y,,, is the change in labor market income, then it may 
simply be the case that the job-training program had differential effects across the 
two treatment cities, or that the mean change in labor market income differed across 
the two control cities, or both. Why should we reject the usual large M, inference 
simply because the job-training program has heterogeneous effects? 

With large group sizes, and whether or not G is especially large, we can put the 
general problem into an MD framework, as done, for example, by Loeb and Bound 
(1996), who had G = 36 cohort-division groups and many observations per group. 
For each group g, write 


Vom = Og + ZgmYg + Ugm, m=1,...,Mg, (20.72) 


where we assume random sampling within group and independent sampling across 
groups. We make the standard assumptions for OLS to be consistent (as M} — «) 
and ,/M,-asymptotically normal, as in Chapter 4. The presence of group-level vari- 
ables x, in a “structural? model can be viewed as putting restrictions on the inter- 
cepts, 6,, in the separate group models in equation (20.72). In particular, 


Ôg = “+X 8, G= Nanak C (20.73) 
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where we think of x, as fixed, observed attributes of heterogeneous groups. With K 
attributes we must have G > K + 1 to determine « and $. If M, is large enough to 
estimate the 6, precisely, a simple two-step estimation strategy suggests itself. First, 
obtain the ô, g along with 7,, from an OLS regression within each group. If G = K + 1, 
then, typically, we can solve for Ê = (4, f’)' uniquely in terms of the G x 1 vector 
ô: 6 = X lô, where X is the (K + 1) x (K + 1) matrix with gth row (1,x,). If G > 
K +1, then, in a second step, we can use a minimum distance approach, as described 
in Section 14.6. If we use Ig, the G x G identity matrix, as the weighting matrix, then 
the minimum distance estimator can be computed from the OLS regression 


6, on 1,x%, g=1,...,G. (20.74) 


Under asymptotics such that M, = pM where 0 < p) < 1 and M — œ, the mini- 
mum distance estimator Ê is consistent and /M-asymptotically normal. Still, this 
particular MD estimator is asymptotically inefficient except under strong assump- 
tions. Because the samples are assumed to be independent, it is not appreciably more 
difficult to obtain the efficient MD estimator, also called the “minimum chi-square” 
estimator. 

First consider the case where Zym does not appear in the first-stage estimation, so 
that the by is just y}, the sample mean for group g. Let ô? denote the usual sample 
variance for group g. Because the y, are independent across g, the efficient MD esti- 
mator uses a diagonal weighting matrix. As a computational device, the minimum 
chi-square estimator can be computed by using the weighted least squares (WLS) 
version of regression (20.74), where group g is weighted by M,/ ô? (groups that have 
more data and smaller variance receive greater weight). Conveniently, the reported £ 
statistics from the WLS regression are asymptotically standard normal as the group 
sizes M, get large. (With fixed G, the WLS nature of the estimation is just a compu- 
tational device; the standard asymptotic analysis of the WLS estimator has G — oo.) 
The minimum distance approach works with small G provided G > K + 1 and each 
M, is large enough so that normality is a good approximation to the distribution of 
the (properly scaled) sample average within each group. 

If Zym is present in the first-stage estimation, we use as the minimum chi-square 
weights the inverses of the asymptotic variances for the g intercepts in the separate G 
regressions. With large M,, we might make these fully robust to heteroskedasticity in 
E(w, | Zgm) using the White ea) sandwich variance estimator. At a minimum we 
would want to allow different o? even if we assume homoskedasticity within groups. 
Once we have the Avar(ô; )—which are just the squared reported standard errors for 
the Og —we use as weights 1 / Avar(d, ) in the computationally simple WLS procedure. 
We are still using independence across g in obtaining a diagonal weighting matrix in 
the MD estimation. 
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An important by-product of the WLS regression is a minimum chi-square statistic 
that can be used to test the G— K — 1 overidentifying restrictions. The statistic is 
easily obtained as the weighted sum of squared residuals, say, SS'R,,. Under the null 
hypothesis in equation (20.73), SSR, ~ 72, , as the group sizes, M,, get large. 
If we reject Ho at a reasonably small significance level, the x, are not sufficient 
for characterizing the changing intercepts across groups. If we fail to reject Ho, we 
can have some confidence in our specification and obtain confidence intervals for 
linear combinations of the population averages using the usual standard normal 
approximation. 

We might also be interested in how one (or more) of the slopes in y, depends on 
the group features, x,. Then, we simple replace 6, with, say, },,, the slope on the 
first element of Zym. Naturally, we would use 1 /Avat(},1) as the weights in the MD 
estimation. 

The minimum distance approach can also be applied if we impose y, = y for all 
g, as in the original model. Obtaining the 6, themselves is easy: run the pooled 
regression 


Yom ON d1y,d2,,...,dGy,Zgm, m=1,...,Myg3 9 = 1,-...;G, (20.75) 


where d1,,d2,,...,dG, are group dummy variables. Using the by from the pooled 
regression (20.74) in MD estimation is complicated by the fact that the Ôg are no 
longer asymptotically independent; in fact, 5, = Y — Z4ĵ, where 7 is the vector of 
common slopes, and the presence of f induces correlation among the intercept esti- 
mators. Let V be the G x G estimated (asymptotic) variance matrix of the G x 1 
vector ô. Then the MD estimator is 6 = (X'V-'X)~'X’V~'6, and its estimated 
asymptotic variance is (X'V-!X)~'. If the OLS regression (20.74) is used, or even the 
WLS version, the resulting standard errors will be incorrect because they ignore the 
across-group correlation in the estimated intercepts. 

Intermediate approaches are available, too. Loeb and Bound (1996) (hereafter, 
LB) allow different group intercepts and group-specific slopes on education, but 
impose common slopes on demographic and family background variables. The main 
group-level covariate is the student-teacher ratio. Thus LB are interested in seeing 
how the student-teacher ratio affects the relationship between test scores and educa- 
tion levels. They use both the unweighted estimator and the weighted estimator and 
find that the results differ in unimportant ways. Because they impose common slopes 
on a set of regressors, the estimated slopes on education (say 7,,) are not asymptoti- 
cally independent, and perhaps using a nondiagonal estimated variance matrix V 
(which would be 36 x 36 in this case) is more appropriate. 

If we reject the overidentifying restrictions, we are essentially concluding that 
Ôg = &+X,P + cg, where cy can be interpreted as the deviation from the restrictions 
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in equation (20.73) for group g. As G increases relative to K, the likelihood of 
rejecting the restrictions increases. One possibility is to apply the Donald and Lang 
approach, where the OLS regression (20.74) is analyzed in the context of the classical 
linear model (CLM) with inference based on the fg_x_ distribution. Why is a CLM 
analysis justified? Since bg = ô + O,(M, z A; we can ingore the estimation error in by 

for large M, Then, it is as if we are e the equation dg = «+X f+ cy, 
g=1,...,G@ by OLS. If the cy are drawn from a normal distribution, classical anal- 
ysis is applicable because c, is assumed to be independent of x,. This approach is 
desirable when one cannot, or does not want to, find group-level observables that 
completely determine the ôq. It is predicated on the assumption that the other factors 
in cy are not systematically related to x,, a reasonable assumption if, say, x, is a 
randomly assigned treatment at the group level, a case considered by Angrist and 
Lavy (2002). 

Unlike in the linear case, for nonlinear models exact inference is unavailable even 
under the strongest set of assumptions. Nevertheless, if the group sizes M, are rea- 
sonably large, we can extend the DL approach to nonlinear models and obtain 
approximate inference. In addition, the minimum distance approach carries over 
essentially without change. 

We can apply the methods to any nonlinear model that has an index structure, 
which includes all the common ones, and many other models besides. Again, it is 
helpful to study the probit case in some detail. With small G and random sampling of 
{(Vgm;Zgm): M = 1,..., Mg} within each g, write 


P( gn =] | Zgm) = D6, + Zgm}g); m = 1, sae „My (20.76) 
by=0+XB,  g=1,...,6. (20.77) 


As with the linear model, we assume the intercept, 6, in equation (20.76), is a func- 
tion of the group features x,. With the M, moderately large, we can get good esti- 
mates of the ô. The oy, g =1,...,G, are easily obtained by estimating a separate 
probit for each group. Or, we can impose common y, and just estimate different 
group intercepts (sometimes called “group fixed effects”). 

Under the restrictions in equation (20.77), we can apply the minimum distance 
approach just as before. Let Avar(6,) denote the estimated asymptotic variances of 
the ô; (so these shrink to zero at the rate 1/ My). If the ô; are obtained from G sepa- 
rate probits, they are independent, and the Avar(6,) are all we need. As in the linear 
case, if a pooled method is used, the G x G matrix Avar(6) should be inverted and 
then used as the weighting matrix. For binary response, we use the usual MLE esti- 
mated variance. If we are using fractional probit for a fractional response, these 
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would be from a sandwich estimate of the asymptotic variance. In the case where the 
ô; are obtained from separate probits, we can obtain the minimum distance estimates 
as the WLS estimates from 


6, on 1,x%, g=1,...,G 
using weights 1 /Avar(5,). This is the efficient minimum distance estimator and, 
conveniently, the proper asymptotic standard errors are reported from the WLS 
estimation (even though we are doing large M,, not large G, asymptotics here). 
Generally, we can write the MD estimator as before: 6 = (X'V-'X) 'x’'V~16, where 
ô is the G x 1 vector of 5, and V = Avar(d). The overidentification test is obtained 
exactly as in the linear case: there are G — K — 1 degrees of freedom in the chi-square 
distribution. 

If we reject the overidentification restrictions, we can adapt Donald and Lang 
(2007) and treat 


ô; = 4 + X4ß + errorg, g= lG (20.78) 
as approximately satisfying the classical linear model assumptions, provided G > 
K + 1, just as before. As in the linear case, this approach is justified if 6, = «+ xf + 
Cg with cy independent of x, and cy drawn from a homoskedastic normal distribution. 
It assumes that we can ignore the estimation error in 6,, based on 6, = ô; + 
O(1/,/M,). Because the DL approach ignores the estimation error in dy, it is 
unchanged if one imposes some constant slopes across the groups. 

Once we have estimated « and £$, the estimated effect on the response probability 
can be obtained by averaging the response probability for a given x: 


G Mg 
rE (35 > OG + xB + imi) l (20.79) 
g=1 


m=1 


where derivatives or differences with respect to the elements of x can be computed. 
Here, the minimum distance approach has an important advantage over the DL 
approach: the finite sample properties of estimator (20.79) are virtually impossible to 
obtain, whereas the large-M, asymptotics underlying minimum distance would be 
straightforward using the delta method. The bootstrap should also be valid when the 
sampling scheme generates independent observations within each g. 

With binary response problems, the two-step methods described here are prob- 
lematical when the response does not vary within group. For example, suppose that 
Xy is a binary treatment—equal to one for receiving a voucher to attend college—and 
Ygm is an indicator of attending college. Each group is a high school class, say. If 
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some high schools have all students attend college, one cannot use probit (or logit) of 
Vom ON Zym, M= 1,...,M,. A linear regression returns zero-slope coefficients and 
intercept equal to unity. Of course, if randomization occurs at the group level—that 
is, X, is independent of group attributes—then it is not necessary to control for the 
Zgm. Instead, the within-group averages can be used in a simple minimum distance 
approach. In this case, as ygm is binary, the DL approximation will not be valid, as 
the CLM assumptions will not even approximately hold in the model 9, = « + xgB + 
ey (because J, is always a fraction regardless of the size of Mọ). 

Naturally, there is nothing special about binary response models. It is possible to 
apply any nonlinear model using the invididual-specific data to obtain group-level 
estimates. Then, equation (20.78) can be applied. 


20.4 Complex Survey Sampling 


Often, survey data are characterized by clustering and variable probability sampling. 
For example, suppose that g represents the primary sampling unit (PSU) (say, city) 
and individuals or families (indexed by m) are secondary sampling units, sampled 
within each PSU with probability pgm. Consider the problem of regression using such 
a data set. If B is the IPW estimator pooled across PSUs and individuals, then its 
variance is estimated as 


g G M; M; G My — 
! 
j 3 XimXgm | Pom fy X > ÛgmügrXmXgr /(PgmPgr) ` X XgmXgm/Pgm | - 
g=1 m= g=] m=1 r=1 g=1 m=1 
(20.80) 


The middle of the sandwich accounts for cluster correlation along with unequal 
sampling probabilities. If the probabilities are estimated using retention frequencies, 
expression (20.80) is conservative, as we discussed in Section 20.2.2. A similar ex- 
pression holds for general M-estimation. Typically, packages that support survey 
sampling require a variable defining the clusters along with a variable containing the 
inverse probability weights. 

Multistage sampling schemes introduce more complications because standard 
stratification is often involved. Consider the following setup, closely related to Bhat- 
tacharya (2005). Let there be S strata (for example, states in the United States), 
exhaustive and mutually exclusive. Within stratum s, there are C, clusters (for example, 
zip codes). In order to use large-sample approximations, we assume that in each 
stratum a large number of clusters is sampled. Typically, the sampling of clusters is 
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without replacement, but the resulting dependence across sampled clusters generated 
is more difficult to study. Instead, we assume sampling with replacement, which is 
harmless if the number of clusters sampled within each stratum, N,, is “large.” As 
before, we allow arbitrary correlation across units (say, households) within each 
cluster (say, zip code). 

Within stratum s and cluster c, let there be Msc total units (households or individ- 
uals). Therefore, the total number of units in the population is 


S 
M=Y Y` Me (20.81) 


It is convenient to start with the problem of estimating the mean of a variable that 
describes the population. Let z be a variable, such as family income, whose mean we 
want to estimate. List all population values as {z2 „: m = 1,..., Msc, €= 1,..., Cy, 
s= 1,..., S}, so the population mean can be written as 


u=MIY YY ep (20.82) 


S s 
t=) z? „= Mu. (20.83) 


Msc „o 


m=1| “scm 


It is also useful to define the totals within each cluster and stratum, Tse = > 
and 1, = ye Tse, respectively. 

The specific sampling scheme is as follows: (1) for each stratum s, randomly draw 
Ny clusters, with replacement; (2) for each cluster c drawn in step (1), randomly 
sample K,- households with replacement. For each pair (s,c), define the sample 
average 

Kye 
eS ae re (20.84) 


m=1 


Because this is an average based on a random sample within (s, c), 


(fixe) = Mee = Mee! > Zem (20.85) 
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To continue up to the cluster level we need the total, Tse = Msceut,., for which an 
unbiased estimator is ĉse = Meefi,. for all {(s,c): c= 1,..., Cs, s=1,..., S} (even if 
we eventually do not use some clusters because they are not sampled). Now, for each 
stratum s, the estimator Nọ ; 5% Tse, Which is the average of the cluster totals 
within stratum s, has casa value which is the ropa average (for stratum 
s), that is, Cr! 3%) tre = CS, OMS 22, = Cy ts. (In general, Cy!t, # 4, = 
(So, M) ts unless each cluster has only one observation.) It follows that an un- 
biased estimator of the total t, for stratum s is 


DSN (20.86) 


Finally, the total in the entire population is estimated as 
S Ns 
5 (« j N;' 5 e) = 
pal c=1 
“yy (24 
Ns 


s=1 c=1 m=1 


S Kse 


(C,/Ns) 5i (Mse/ Kse) X Zsem 


sel m=1 


e) Sam = SSS oszen (20. 87) 


s=1 g=] m=1 


5 


where 


Cs Msc 
Ns Kse 


Ose (20.88) 
is the weight for every unit sampled in stratum-cluster pair (s,c). This weight 
accounts for undersampled or oversampled clusters within strata and undersampled 
or oversampled units within clusters. Expressions (20.87) and (20.88) appear in the 
literature on complex survey sampling, sometimes without M,./Ks- when each cluster 
is sampled as a complete unit, and so M,./Kse = 1. To estimate the population mean, 
u4, we just divide by M, the total number of units in the population, 


S N, Kee 
= (>: em onion (20.89) 


s=1 c=1 m=1 


In fact, we do not need to know the population size, M, to obtain an unbiased esti- 
mator of u. We can obtain an alternative estimator that uses a modified set of 
weights. It falls out naturally from a regression framework, to which we now turn. 
To study the asymptotic properties of regression (and many other estimation 
methods), it is convenient to modify the weights so that they are constant, or con- 
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verge to a constant. The weights @,. in expression (20.88) converge to zero at rate 
N7! because C, and Msc are fixed and Ke is treated as fixed. (We assume a relatively 
small number of households sampled per cluster.) Let N = N; + No +---+ Ns be 
the total number of clusters sampled and define 


Cy Msc 


WN) Ko We (20.90) 


Usce = 
As in Bhattacharya (2005), it is easiest just to assume N,=a;N for a, fixed, 
0<a,<1,a,+---+as = 1. But we can also just assume N;/N converges to as with 
the same property. Therefore, by writing vse = (C,/as)(M;/Ks), we see that vse is 
constant. Further, any optimization problem that uses @,. as weights gives the same 
answer when vse is used because the scale factor in equation (20.90) does not depend 
on s or c. The key in the formulas for the asymptotic variance below is that vs. is 
(roughly) constant. 

While equation (20.90) is the most natural definition of the weights for obtaining 
the limiting distribution results, we can use different formulations without changing 
the end formulas. For example, let C = C1 + ---+ Cs be the total number of clusters 
in the population, let M be the total number of units in the population, and let K be 
the total units samples. Then, for the final formulas, we could use the weights defined 
as 


(Cs/C) (M/M) _ (NK) 
(N/N) (Kre/K) (CM) 


Dsc = (20.91) 
Because C, M, and K are fixed, the factor K/(CM) has no effect on estimation or 
inference. Equation (20.91) has a nice interpretation because it is expressed in terms 
of frequencies of the population relative to the sample frequencies. For example, if 
(C,/C) > (N;/N), which means that stratum s is underrepresented in terms of num- 
ber of clusters, equation (20.91) gives more weight to such strata. The same is true of 
the fractions involving the number of units (say, households). 

While we can consider general M-estimation problems, or generalized method of 
moments as in Bhattacharya (2005), we consider least squares for concreteness. The 
weighted minimization problem is 


z 


S Kye 
mia N` DD Use (Ysem — Sunk); (20.92) 


m=1 


ll 
a 


s= C: 


where it is helpful to divide by N to facilitate the asymptotic analysis as N — co. The 
first-order condition is 
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S N KĀ : 
N! 5 DeX sonl Ysem — Xsemp) =0. (20.93) 


s=1 c=1 m=1 


Using arguments similar to the SS sampling case, but accounting for the clus- 
tering (by, in effect, treating each cluster as its own observation), we can show 
that an appropriate estimator of Avar(Z)—in the sense that it is consistent for 
Avar[/N(B — B)| when multiplied by N—is 


S Ns Ke S N, Ky 1 
(>: 5 5 Ta (X 5 5 Use c > (20.94) 


s=1 r= m=1 =l c=] m=1 


where B is somewhat complicated: 


Ny Kee Kee 


S 
v2 + v? AscmbserX „X 
sC Usem Xscm Xscm SC UsemUser sem“ Scr 


wl cl m=1 s=] c=l m=1r#m 


s Ns Ke Ns Ke í 
X l -1 X D ) 

= N; VscXsemblscm DseXemlscm (20.95) 
sl g 1 


c=1 mel =1 m= 

The first part of Ê is obtained using the White “heteroskedasticity”-robust form. The 
second piece accounts for the correlation within clusters; this is typically a positive 
definite matrix, and it generally increases the asymptotic standard errors. The third 
piece actually reduces the variance by accounting for the nonzero means of the 
“score” within strata, just as in the SS sampling case. 

If each cluster has just one unit, so M,. = Kse = 1, then expression (20.94) reduces 
to 


N, =i A 
X ) X ) 1 X ) X ) 2,2 ./ 
VseXseXse VscUseXseXse 


s=1 c=l1 


ži 
oe] (20.96) 


S N, N, 4 S N, 
-EN (Soex) (Soran) | (X 
s1 gl c=1 s=1 c=1 
which is the formuala for standard stratified sampling with a finite number of units in 
each stratum. 
For general M-estimation, the outer sandwich in (20.94) is replaced with the 
inverse of the weighted Hessian, E ya OD 1 Vse H(Wsem, ®ÔJ', while x’, sem in 


equation (20.95) is replaced with the score, s(Wsem, 6). Some econometrics packages 


Stratified Sampling and Cluster Sampling 899 


have made implementation fairly straightforward for a variety of linear and non- 
linear models. To obtain the correct asymptotic variance estimator—one that is nei- 
ther too optimistic nor too conservative—one needs to specify the strata, the clusters, 
and the sampling weights. 


Problems 


20.1. Use expressions (20.4) and (20.5) to answer this question. 

a. Derive the estimator in equation (20.5) from the minimization problem in expres- 
sion (20.4). 

b. Show directly that the estimator ñ, = N~! 5Y 1(s:/p;)wi is unbiased for u. 


c. What practical advantage does /i,, have over ,,? 


20.2. Use the log likelihood in equation (20.9) to derive p; = Mj/Nj, j=1,...,J, 
where M; is the number of retained observations from stratum j and N; is the num- 
ber of times stratum j was drawn. 


20.3. Let y be a scalar response variable and x a vector of explanatory variables, 
and let m(x, 0) denote a model for E(y| x). The parameter space is ©. 


a. Let Ô, be the IPW nonlinear least squares estimator. Write down the minimization 
problem solved by 0w. 


b. Assume that the model is correctly specified for E(y |x), and let 0, denote the 
population value; assume that 0, is identified in the population. Provide a set of suf- 
ficient conditions for consistency of 6,, for 0,. (Hint: See Theorem 12.2.) 


c. Assuming that m(x, -) is twice continuously differentiable on the interior of © and 
that 0, € int(®), propose an estimator of the asymptotic variance of 0, that depends 
only on the gradient of m(x, -)—not its Hessian. 


d. If you add the homoskedasticity assumption Var(y|x) = a2, does the formula 


from part c simplify? 


e. If m(x, 0) is misspecified, how should you adjust the estimator in part c? 


20.4. Consider the problem of standard stratified sampling. Assume that the sample 
shares, H;, converge to H; >0 as N— œ, j=1,...,J. Further, suppose that 0, 
minimizes E[q(w, 0) | x] over © for each x and that 0, uniquely minimizes E[q(w, 0)] 
over ©. Argue that the unweighted estimator is consistent for 0,. (Hint: Write the 
unweighted objective function as 
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J Nj 
Sala Samo) 
j=l i=l 


and argue that this function converges uniformly to 
H E{q(w, 8) | x € %1] + H2Elq(w, 8) |x € H] +--- + AyElq(w, 0) |x € Zy, 


where the strata are %1, %,...,27. Then show that 0, uniquely minimizes this 
expression by arguing that it uniquely minimizes E|q(w, 0) | x € %;] for at least one j.) 


20.5. Use the data in BENEFITS.RAW to answer this question. 


a. To equation (20.33) add the within-district averages of bs, Istaff, lenroll, and 
lunch, where /staff and lenroll denote logarithms. Estimate this equation by pooled 
OLS. How do the coefficients on bs, /staff, lenroll, and lunch compare with the FE 
coefficients? Are the usual pooled OLS standard errors valid here? 


b. Estimate the equation from part a by random effects. (That is, include the district 
averages along with the original variables.) How do these estimates compare with the 
FE estimates? How do the cluster-robust standard errors compare with the cluster- 
robust standard errors for FE? 

c. Use the estimation in part b to obtain the value of the fully robust Wald statistic 
testing the RE assumption that the district effect is uncorrelated with the four district 
averages. 


20.6. Use the data in BENEFITS.RAW to answer this question. 


a. How many schools in the sample have a benefits-salary ratio of at least 0.5? 


b. Estimate equation (20.33) by fixed effects omitting the observations from part a. 
Discuss how the estimate of /,, changes, as well as its cluster-robust standard error. 


c. Now add the within-district averages of all four variables and estimate the equa- 
tion by least absolute deviations, using all the observations. How strong is the evi- 
dence for a trade-off using LAD? 


20.7. Use the data in MEAP94_98 to answer this question. 


a. How many schools have all five years of data? Are there any schools with only one 
year? 

b. Obtain the within-school time averages of the variables /avgrexp, lunch, lenrol, 
and the four-year dummies y95 through y98. Include these in a pooled OLS regres- 
sion that includes the other variables in Table 20.2 (including the year dummies 
themselves). Verify that the coefficients on the original variables are the FE estimates. 
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What is the coefficient on /unch (the time average)? Is it statistically different from 
zero using a cluster-robust standard error at the district level? 

c. Now use RE rather than pooled OLS on the equation in part b. Again verify that 
you obtain the FE estimates on the original variables. Is the RE coefficient on lunch 
identical to the POLS coefficient? Is it still statistically significant? 


d. Redo part c, but do not include the time averages of the year dummies. Do you 
still get the FE estimates on /avgrexp, lunch, and lenrol? Why, with an unbalanced 
panel, must we include the time averages of year dummies for RE to equal FE, 
whereas we did not have to in the balanced case? 

e. Now go back to the original FE estimation in Table 20.2, but drop the year 
dummies. How does the estimated spending effect change from Table 20.2? Which 
estimate is more reliable? 

f. Return to the equation implicit in Table 20.2, but estimate the equation by pooled 
OLS and RE. (That is, do not include the time averages of the variables.) How do the 
estimates of the spending effect compare with the FE estimates? How come the lunch 
variable is much more important in the POLS and RE estimation? 

g. Considering the various estimates and standard errors in Table 20.2 and obtained 
for this problem, which estimate of the spending variable and which standard error 
seem most reliable? 


20.8. In the setting of Section 20.3.1, let y,,,. be a fractional response variable, and 
consider the model 


E(vgm | Xg, Zg, Cg) = B(& + XgB + Zgmy + cg). 


a. Assume that cy = 4, + Zg€q + ag. Find E(ygm | Xg, Zg, ag). 
b. Add the assumption a, | x,,Z, ~ Normal(0, T2) and find E( Ygm | Xg, Zg). (Hint: It 
should have the probit form.) 


c. Suppose that 77,, ¢,, and T? depend only on the group size, M,. Suggest a method 
for estimating all the parameters. 


d. How would you perform inference on the parameters estimated in part c? 


Z l Estimating Average Treatment Effects 


21.1 Introduction 


We now explicitly cover the problem of estimating an average treatment effect (ATE), 
sometimes called an average causal effect. An ATE is a special case of an average 
partial effect—it is an APE for a binary explanatory variable—and therefore many 
of the econometric models and methods that we have used in previous chapters can 
be applied or adapted to the problem of estimating ATEs. 

Estimating ATEs has become important in the program evaluation literature, such 
as the evaluation of job-training programs or school voucher programs. Many of the 
early applications of the methods described in this chapter were to medical inter- 
ventions, and some of the language (such as “treatment group” and “control group”) 
is a carryover from those early applications. But the methods have proven to be 
useful in situations where experiments are clearly impractical. 

The organizing principle of a modern approach to program evaluation is the 
counterfactual framework pioneered by Rubin (1974)—in fact, the framework has 
been dubbed the Rubin causal model (RCM)—and since adopted by many authors 
in statistics, econometrics, and many other fields, including Rosenbaum and Rubin 
(1983), Heckman (1992, 1997), Imbens and Angrist (1994), Angrist, Imbens, and 
Rubin (1996), Manski (1996), Heckman, Ichimura, and Todd (1997), and Angrist 
(1998). Research on estimating treatment or causal effects using Rubin’s framework 
continues unabated. This chapter is intended to provide an introduction and a fairly 
detailed treatment of the most commonly used methods. Recent surveys include 
Heckman, Lalonde, and Smith (2000), Imbens (2004), Heckman and Vytlacil (2007a, 
2007b), and Imbens and Wooldridge (2009), to name a handful. 

Counterfactual thinking is not reserved for estimating average treatment effects 
using the RCM framework. Recall that in Chapter 9 we discussed how sensible 
applications of simultaneous equations models entail being able to think about each 
equation in isolation from other equations in the system. (We called this the auton- 
omy requirement.) For example, a demand function is defined for each possible price, 
even though when we collect data the prices we observe are (usually assumed to be) 
equilibrium prices determined by the intersection of supply and demand. Therefore, 
the reasoning underlying the RCM should be familiar even if the particulars are not. 

This chapter mainly focuses on binary treatments, although Section 21.6.2 briefly 
describes approaches that are available when the treatment takes on more than two 
values. It is also possible to consider continuous “treatments,” which would make 
coverage of simultaneous equations models and explicit counterfactuals possible. 
This chapter does not attempt such a unification. Pearl (2000) provides a counter- 
factual setting that encompasses SEMs. 
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Most approaches to estimating average treatment effects fall into one of three 
approaches. The first exploits ignorability or unconfoundedness of the treatment con- 
ditional on a set of observed covariates. As we will see in Section 21.3, this approach 
is analogous to the proxy variable solution to the omitted variables problem that we 
discussed in Chapter 4. In fact, one approach to estimating treatment effects is to use 
linear regression with many controls: in effect, the treatment is exogenous once we 
control for enough observed factors. As we will see, an important benefit of igno- 
rability of treatment is that no functional form or distributional assumptions are 
needed to identify the population parameters of interest (even though, as a practical 
matter, we may make parametric assumptions). 

A second approach allows selection into treatment to depend on unobserved (and 
observed) factors. Traditionally, we would say that the treatment is “endogenous.” 
In this case, we rely on the availability of instrumental variables (IVs) in order to 
identify and estimate average treatment effects. Sometimes standard IV estimators 
identify the effects of interest, but in other cases we rely on control function methods. 
Depending on the quantity we hope to estimate, we generally need to impose 
restrictions on functional forms or distributions or both. However, we will also dis- 
cuss the work by Imbens and Angrist (1994) that provides a useful interpretation of 
IV estimation under very weak assumptions. We discuss IV approaches in Section 
21.4. 

This chapter also provides an introduction to regression discontinuity designs, 
where treatment—or the probability of treatment—is a discontinuous function of 
an observed forcing variable. If underlying regression functions are assumed to be 
smooth in the forcing variable, the discontinuity of treatment can be used to identify 
a local treatment effect. Section 21.5 considers both sharp and fuzzy designs. 

There is much ongoing research on estimating average treatment effects. Some of 
these active areas are touched on in Section 21.6. For example, Sections 21.6.2 and 
21.6.3 consider multivalued treatments and multiple treatments, respectively. 

Section 21.6.4 gives a brief discussion of estimating treatment effects with panel 
data, showing how standard unobserved effects models can be obtained when one 
assumes unconfoundedness of the history of treatments conditional on time-constant 
unobserved heterogeneity (and observed covariates). An alternative approach, which 
assumes unconfoundedness conditional on the observed past history, is also covered. 


21.2 A Counterfactual Setting and the Self-Selection Problem 


The modern literature on treatment effects begins with a counterfactual, where 
each individual (or other agent) has an outcome with and without treatment (where 
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“treatment” is interpreted very broadly). This section draws heavily on Heckman 
(1992, 1997), Imbens and Angrist (1994), and Angrist, Imbens, and Rubin (1996) 
(hereafter AIR). Let yı denote the outcome with treatment and yọ the outcome with- 
out treatment. Because an individual cannot be in both states, we cannot observe 
both yo and y,; in effect, the problem we face is one of missing data. In fact, we will 
show how to apply the inverse probability weighting methods from Section 19.8 to 
the problem of estimating average treatment effects. 

It is important to see that we have made no assumptions about the distributions of 
yo and yı. In many cases these may be roughly continuously distributed (such as 
salary), but often yọ and y, are binary outcomes (such as a welfare participation in- 
dicator), or even corner solution outcomes (such as married women’s labor supply). 
However, some of the assumptions we make will be less plausible for discontinuous 
random variables, something we discuss after introducing the assumptions. 

The following discussion assumes that we have an independent, identically distri- 
buted sample from the population. This assumption rules out cases where the treat- 
ment of one unit affects another’s outcome (possibly through general equilibrium 
effects, as in Heckman, Lochner, and Taber, 1998). The assumption that treatment 
of unit 7 affects only the outcome of unit 7 is called the stable unit treatment value 
assumption (SUTVA) in the treatment literature (see, for example, AIR). We are 
making a stronger assumption because random sampling implies SUTVA. 

Let the variable w be a binary treatment indicator, where w = 1 denotes treatment 
and w = 0 otherwise. The triple (yọ, y}, w) represents a random vector from the 
underlying population of interest. For a random draw i from the population, we write 
(Yio Ya, Wi). However, as we have throughout, we state assumptions in terms of the 
population. 

To measure the effect of treatment, we are interested in the difference in the out- 
comes with and without treatment, yı — yo. Because this is a random variable (that 
is, it is individual specific), we must be clear about what feature of its distribution 
we want to estimate. Several possibilities have been suggested in the literature. In 
Rosenbaum and Rubin (1983), the quantity of interest is the average treatment effect 
(ATE), 


Tate = E(y, — Yo), (21.1) 


which is the expected effect of treatment on a randomly drawn person from the 
population. Some have criticized this measure as not being especially relevant for 
policy purposes: because it averages across the entire population, it includes in the 
average units who would never be eligible for treatment. Heckman (1997) gives the 
example of a job training program, where we would not want to include millionaires 
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in computing the average effect of a job training program. This criticism is somewhat 
misleading, as we can—and would—exclude people from the population who would 
never be eligible. For example, in evaluating a job training program, we might re- 
strict attention to people whose pretraining income is below a certain threshold; 
wealthy people would be excluded precisely because we have no interest in how job 
training affects the wealthy. In evaluating the benefits of a program such as Head 
Start, we could restrict the population to those who are actually eligible for the pro- 
gram or are likely to be eligible in the future. In evaluating the effectiveness of en- 
terprise zones, we could restrict our analysis to block groups whose unemployment 
rates are above a certain threshold or whose per capita incomes are below a certain 
level. 

A second quantity of interest, and one that has received much recent attention, is 
the average treatment effect on the treated, which we denote Tarr: 


Tart = E(y, — yo|w = 1). (21.2) 


That is, Taz is the mean effect for those who actually participated in the program. 
As we will see, in some special cases equations (21.1) and (21.2) are equivalent, but 
generally they differ. 

Imbens and Angrist (1994) define another treatment effect, which they call a local 
average treatment effect (LATE). LATE has the advantage of being estimable using 
instrumental variables under very weak conditions. It has two potential drawbacks: 
(1) it measures the effect of treatment on a generally unidentifiable subpopulation; 
and (2) the definition of LATE depends on the particular instrumental variable that 
we have available. We will discuss LATE in the simplest setting in Section 21.4.3. 

We can expand the definition of both treatment effects by conditioning on covari- 
ates. If x is an observed covariate, the ATE conditional on x is simply E(y,; — yo |x); 
similarly, equation (21.2) becomes E(y, — yo|x,w = 1). By choosing x appropri- 
ately, we can define ATEs for various subsets of the population. For example, x 
can be pretraining income or a binary variable indicating poverty status, race, or 
gender. Recent work by Heckman and Vytlacil (2006) and Heckman, Urzua, and 
Vytlacil (2006) unifies the various kinds of average treatment effects by defining the 
marginal treatment effect. In this chapter, we focus on estimating Tate, Tatt, and these 
effects on various subpopulations. 

As noted previously, the difficulty in estimating equation (21.1) or (21.2) is that we 
observe only yp or yı, not both, for each person. More precisely, along with w, the 
observed outcome is 


y = (1 — w)yo + wy = yo + w( yı — yo). (21:3) 
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Therefore, the question is, How can we estimate Tare OF Tan With a random sample on 
y and w (and usually some observed covariates)? 

First, suppose that the treatment indicator w is statistically independent of ( yọ, 1), 
as would occur when treatment is randomized across agents. One implication of 
independence between treatment status and the potential outcomes is that Tate and 
Ta are identical: E( y; — yo |w = 1) = E(y, — yo). Furthermore, estimation of Tate is 
simple. Using equation (21.3), we have 


E(y|w = 1) = E(yı |w = 1) = E(y), 


where the last equality follows because y; and w are independent. Similarly, 

E(y |w = 0) = E(yo |w = 0) = E(y). 

It follows that 

Tate = Tat = E(y|w = 1) -—E(y|w =0). (21.4) 


The right-hand side is easily estimated by a difference in sample means: the sample 
average of y for the treated units minus the sample average of y for the untreated 
units. Thus, randomized treatment guarantees that the difference-in-means estimator 
from basic statistics is unbiased, consistent, and asymptotically normal. In fact, these 
properties are preserved under the weaker assumption of mean independence: 
E(yo |w) = E(yp) and E(y; |w) = E(91). 

Randomization of treatment is often infeasible in program evaluation (although 
randomization of eligibility sometimes is feasible; more on this topic later). In most 
cases, individuals at least partly determine whether they receive treatment, and their 
decisions may be related to the benefits of or gain from treatment, yı — yo. In other 
words, there is self-selection into treatment. 

It turns out that Tay can be consistently estimated as a difference in means under 
the weaker assumption that w is independent of yọ, without placing any restriction on 
the relationship between w and y,. To see this point, note that we can always write 


E(y|w = 1) — E(y|w = 0) = E(x |w = 1) — E(yp| w= 0) + E( — vol = 1) 

= [E(yo |w = 1) — E( yo |w = 9) + Tart. (21.5) 
If yo is mean independent of w, that is, 
E(yo |w) = E(yo), (21.6) 


then the first term in equation (21.5) disappears, and so the difference in means esti- 
mator is an unbiased estimator of Tar. Unfortunately, condition (21.6) is still a pretty 
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strong assumption. For example, suppose that people are randomly made eligible for 
a voluntary job training program. Condition (21.6) effectively implies that the par- 
ticipation decision is unrelated to what people would earn in the absence of the pro- 
gram. Nevertheless, the standard difference-in-means estimator is consistent for Tar in 
some scenarios where it is inconsistent for Tare. 

A useful expression relating ty, and Tate is obtained by writing yọ = uy + vo and 
Yı = 44 +1, where u, = E(y,), g = 0,1. Then 


Yı — Yo = (My — Mo) + (vı — vo) = Tate + (vı — vo). 
Taking the expectation of this equation conditional on w = 1 gives 
Tatt = Tate + E(v; — Vo | w= 1). 


We can think of vı — vo as the person-specific gain from participation—that is, the 
deviation from the population mean—and so Tan differs from Tate by the expected 
person-specific gain for those who participated. If y} — yo is not mean independent of 
W, Tatt ANd Tate generally differ. 

Fortunately, we can estimate Tare and Tan under assumptions less restrictive than 
independence between (yp, y1) and w. In most cases, we can collect data on individ- 
ual characteristics and relevant pretreatment outcomes—sometimes a substantial 
amount of data. If, in an appropriate sense, treatment depends on the observables 
and not on the unobservables determining (yọ, yı), then we can estimate average 
treatment effects quite generally, as we show in the next section. 


21.3 Methods Assuming Ignorability (or Unconfoundedness) of Treatment 


We adopt the framework of the previous section, and, in addition, we let x denote a 
vector of observed covariates. Therefore, the population is described by (yọ, y1, w, X), 
and we observe y, w, and x, where y is given by equation (21.3). When w and (yo, y1) 
are allowed to be correlated, we need an assumption in order to identify treatment 
effects. Rosenbaum and Rubin (1983) introduced the following assumption, which 
they called ignorability of treatment (given observed covariates x): 


ASSUMPTION ATE.1 (Ignorability): Conditional on x, w and (yo,1) are indepen- 
dent. 


Assumption ATE.1 has also been called unconfoundedness or simply conditional 
independence. For many purposes, it suffices to assume ignorability in a conditional 
mean independence sense: 
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ASSUMPTION ATE.1’ (Ignorability in Mean): (a) E(yọ|x,w) = E(yọ|x); and (b) 
E(y, |x, w) = E(x; |x). 


Naturally, Assumption ATE.1 implies Assumption ATE.1’. In practice, Assumption 
ATE.1’ might not afford much generality, although it does allow Var( yọ |x, w) and 
Var(y,|x,w) to depend on w. The idea underlying Assumption ATE.1’ is this: if 
we can observe enough information (contained in x) that determines treatment, then 
(0,1) might be mean independent of w, conditional on x. Loosely, even though 
(Yo, y1) and w might be correlated, they are uncorrelated once we partial out x. 

Assumption ATE.1 certainly holds if w is a deterministic function of x, which has 
prompted some authors in econometrics to call assumptions like ATE.1 selection on 
observables; see, for example, Barnow, Cain, and Goldberger (1980), Goldberger 
(1981), Heckman and Robb (1985), and Moffitt (1996). (We discussed a similar as- 
sumption in Section 19.9.3 in the context of missing data and attrition.) The name is 
fine as a label, but we must realize that Assumption ATE.1 does allow w to depend 
on unobservables, albeit in a restricted fashion. If w = g(x,a), where a is an unob- 
servable random variable independent of (x, yọ, y1), then Assumption ATE.1 holds. 
But a cannot be arbitrarily correlated with yọ and y}. 

To proceed, it is helpful to define the two counterfactual conditional means, 


M(X) = E(yo|x), a(x) = E(y1 | x). (21.7) 


In general, these functions are unknown. But ignorability—in particular, Assumption 
ATE.1’—along with another assumption that we will discuss shortly, are sufficient to 
identify x,(-), g = 0, 1. We will show this point in the next section. First, it is impor- 
tant to know that, under Assumption ATE. 1’, the average treatment effect conditional 
on x and the average treatment effect on the treated conditional on x are identical. 
More precisely, define 


Tate(x) = E(yi — yo |x) = 4 (xX) — u(x) (21.8) 
Tat(X) = E(yı — yo|x,w = 1). (21.9) 
Then, by Assumption ATE.1’, E(y,|x,w) = E(y |x), w=0,1, and so tare(x) = 
Tau(X). In general, though—even in cases where Ty;(x) is identified—t,,.(x) and 


Tatt(X) can be different. 

Intuitively, the ignorability assumption seems to have a better chance of holding 
when the set of control variables, x, is richer. But one must be careful not to include 
variables in x that can themselves be affected by treatment. For example, suppose w 
is a job training indicator, and y is future labor earnings. We would not want to 
include in x a measure of, say, education obtained between the time of assignment 
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and the time labor earnings are measured. By doing so, we in effect hold additional 
education fixed when it varies in reaction to assignment w. Generally, including fac- 
tors in x that are affected by w causes ignorability to fail. 

Mathematically, Wooldridge (2005d) has shown that ignorability fails when x is 
influenced by w in the following setting. Suppose w is actually randomized with 
respect to (yo, yı)—in which case the standard difference-of-means estimator is 
unbiased and consistent for Tare. But at least one x is related to assignment in the 
sense that D(x| w = 1) 4 D(x|w = 0). Using iterated expectations, it can be shown 
that E(y, |x, w) = E(y, |x), g = 0,1 if and only if E(y,|x) = E(y,), g = 0,1. That 
is, ignorability holds only if the covariates do not help to predict the counterfactual 
outcomes. If x includes education attained between assignment and measurement of 
y, E(y, |x) almost certainly depends on x, and so unconfoundedness fails. 

Good candidates for inclusion in x are variables measured prior to treatment 
assignment, including past outcomes on y. As shown in Wooldridge (2009c), variables 
that satisfy instrumental variables assumptions—they are independent of unobserv- 
ables that affect (vo, y1) but help predict w—should be excluded because their inclu- 
sion increases bias in standard regression adjustment estimators unless ignorability 
holds without the instrument-like variables. 

Unfortunately, ignorability is fundamentally untestable because we only observe 
(y,w,x). In some cases, it can be tested indirectly—see, for example, the discussion 
in Imbens and Wooldridge (2009). An alternative is to perform a sensitivity analysis 
similar to studying omitted variables. Imbens and Wooldridge (2009) survey some 
possibilities. 

Assuming that ignorability holds, what is the additional assumption we need to 
identify the unconditional average treatment effect, Tate? From the law of iterated 
expectations, 


Tate = E[tate(x)] = Ely (x) — Mo(x)], (21.10) 


where the expectations are over the distribution of x. As we will see in the next couple 
of subsections, estimating Tate will require being able to observe both control and 
treated units for every outcome on x (a weaker assumption, to be stated precisely, 
suffices for Tag). This assumption is typically called the overlap assumption. 


ASSUMPTION ATE.2 (Overlap): For all xe 2, where % is the support of the co- 
variates, 


0<P(w=1|x) <1. (21.11) 


Overlap means that, for any setting of the covariates in the assumed population, 
there is a chance of seeing units in both the control and treatment groups. If, for 
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example, P(w = 1|x = xo) = 0, then units having covariate values xo will never be in 
the treated group. Generally, we will not be able to estimate an average treatment 
effect over the population that includes units with x = xo. 

The probability of treatment, as a function of x, plays a very important role in 
estimating average treatment effects. It is usually called the propensity score, and we 
denote it 


p(x) =P(w=1|x), xef. (21.12) 


The overlap assumption rules out the possibility that the propensity score is ever zero 
or one. 

Rosenbaum and Rubin (1983) call igorability plus overlap strong ignorability, an 
assumption that is critical in all approaches to estimating Tare. For completeness, we 
state weaker versions of the ignorability and overlap assumptions that suffice for 
identifying Tar: 


ASSUMPTION ATT.1’ (Ignorability in Mean): E(yo |x, w) = E(yo |x). 
ASSUMPTION ATT.2 (Overlap): For all xe 2, P(w = 1|x) < 1. 
21.3.1 Identification 


Given the previous ignorability and overlap assumptions, we can establish quite 
generally that Tate (and, under the weaker assumptions) Tar are identified. We do so in 
two ways, each of which motivates subsequent estimation methods. 

Our first approach is based directly on the conditional mean E(y | x, w). Recall that 
we can write y = yo + w( yı — yo). Under the mean version of ignorability, 


E(y|x,w) = E(yo|x,w) + w[E(y1ı | x, w) — E( yo | x, w)] 
= E(yo |x) + w[E(v1| x) — E(yo | x)] 


= m(x) + wļa (x) — Mo(x)], (21.13) 
where getting to equation (21.13) uses Assumption ATE.1'. We have shown 
E(y|x,w=0)=so(x),  E(|x,w = 1) = (x). (21.14) 


Because we observe (y,x,w), we can, under the overlap assumption, estimate 
m(x) = E(y |x, w = 0) and m(x) = E(y|x,w = 1) quite generally—and this claim 
has nothing to do with whether ignorability holds. If Assumption ATE.1’ holds, 
then 


Tate(X) = m (x) — m(x). (21.15) 
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In other words, when we add overlap, the functions m,(-), g = 0,1 that we can esti- 
mate correspond to the quantities j,(-) that we need to estimate in order to estimate 
Tate(X). If we identify m,(-) for all x e #—and this is where overlap comes in—then 
we can obtain Tate as 


Tate = Elm (x) — mo(x)], (21.16) 


where again the expected value is over the distribution of x. In practice, with a ran- 
dom sample, we use sample averaging. Details are given in the next subsection. 

If we are able to identify t(x) at all x e X, then we can also identify the average 
treatment effect on any subset of the population defined by x. For example, if we 
define 


Tae, a = E(y1 — yo | xe 2), (21.17) 
then we have 
Tate, R = E[tate(X) | XE AJ, 


and so we can average Tate(X) over the subpopulation with x € 2. 

To see that the weaker ignorability and overlap assumptions, Assumptions ATT. 1’ 
and ATT.2, suffice for identifying Tan, note that we can use y = yo + w(y1 — yo) to 
always write 


E(y|x,w = 1) — E(y|x,w = 0) 
= E(yo|x, w = 1) — E(yo|x,w = 0) + E(yı — yo|x,w = 1) 
= [E(yo|x,w = 1) — E(yo |x, w = 0)] + Tar(x) (21.18) 


If Assumption ATT.1’ holds, then the term in [-] in equation (21.18) is zero, and the 
difference in estimable means, mı (x) — (x), actually identifies Tar(x): 


Tau(X) = mı (x) — m(x). (21.19) 


Assumption ATT.1’ allows the gain from treatment, yı — yo, to be arbitrarily cor- 
related with treatment, even after conditioning on x. It requires ignorability (in 
mean) only with respect to yo, the outcome in the absence of treatment. Now Tatt is 
obtained as 


Tau = Eļm (x) — mo(x) | w = 1]. (21.20) 


We can use this expression to see how the weaker overlap condition, Assumption 
ATT.2, suffices. By definition, ™(-) is for the treated subpopulation, w = 1, and so 
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we only need to estimate this regression function for values of x corresponding to 
units in the treated group. In other words, we do not need a positive probability of 
treatment for all x. But we still must estimate mọ(-) at values of x corresponding to 
the treated subpopulation. If there are values of x where treatment is certain, we 
cannot hope to generally estimate E(y|x,w = 0) for such values because we will not 
observe control units with the same values of the covariates. The weaker overlap 
assumption, P(w = 1|x) < 1, rules out this possibility. 
We summarize the previous discussion with a proposition. 


PROPOSITION 21.1: Under Assumption ATE.1’, equation (21.15) holds. If we add 
the overlap Assumption ATE.2, we can obtain Tare as in equation (21.16). If we 
assume only Assumption ATT.1', we have tay(x) = mı (x) — mo(x); under Assump- 
tion ATT.2, Tan is identified as in equation (21.20). 


A second way to establish identification is to use inverse propensity score weight- 
ing. First maintain Assumption ATE.1’. Noting that wy = wy, we have, by iterated 


expectations, 
e| wy x _ | x] _ efeja : v x} = pf | x, w) x} 
P(x) P(X) P(x) P(X) 


EAA 
=F{ p(x) 


w 
X =E 
} E 
because E(w |x) = p(x). A similar argument shows that 


Combining these two results and using simple algebra gives 


xju = (x) 


[w— p(x)|y | \ 
Es, ——————_|x > = (x) — hX) = Tare(X). 21.21 

{aot a tp HO) — Ha) = ie 
Of course, this expression only makes sense for x such that 0 < p(x) < 1. If we 
maintain Assumption ATE.2, and assume that the expectation exists, we can use 
iterated expectations to write 


_ pf w- pooly 
re =E E P en 


which, because w, y, and x are all observed, establishes identification of Tare using the 
propensity score rather than regression functions. 


914 Chapter 21 


The argument for Tay is a little different. Write 
[w — p(x)]y = [w — p(x)] [vo + wO1 — yo)] 


= [w= p(x] ¥0 + ww = p] = yo) 


= |w — p(x)] yo + w[1 — p(x)|(yı — yo), 


where the last equality follows because w? = w. Therefore, 


w- py wo e, 
[1—p(x)] [1 — px) + w(yi = Yo). (21.23) 


Consider the numerator of the first term on the right-hand side of equation (21.23): 
E{[w — p(x)]yo |x} = E(E{[w — p(x)] yo | x, w} |x) = E{[w — p(x)JE(vo| x, w) |x} 
= E{[w — p(x)JE(vo| x) |x} = E{ lv — p(x)] |x}uo(x) = 0. 


Therefore, 
w- PROP || — Brey, — vn) 
efie x} = E[w(yı — yo) |x] (21.24) 


and so the unconditional expectations are the same, too. But 
E[w(31 — yo)] = P(w = O)E[w( 1 — yo) |w = 0] + Pw = IEfw(y1 — yo) | w = 1] 
= 0+ P(w = 1)Efw(yı — yo) | w= 1] 
= plant, (21.25) 


where p = P(w = 1) is the unconditional probability of treatment. Putting the pieces 
together gives 


[w- on} 
Tart = El >; 21.26 
ET no) nan 
notice that this expression only requires the weaker assumption p(x) < 1 for all x. 


We summarize with a proposition. 


PROPOSITION 21.2: Under Assumptions ATE.1’ and ATE.2, Tate can be expressed as 
in equation (21.22). Under the weaker assumptions Assumption ATT.1’ and ATT.2, 
Ta can be expressed as in equation (21.26). 


Wooldridge (1999c) derived expression (21.22) in the context of random coefficient 
models, while equation (21.26) is essentially due to Dehejia and Wahba (1999), who 
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used the stronger Assumption ATE.1. As we will see, these identification results can 
be directly turned into estimating equations for Tate and Ta. We now turn to estima- 
tion in the next several subsections. 


21.3.2 Regression Adjustment 


The identification strategy based on the regression functions E(y|x,w=0) and 
E(y|x,w=1) leads directly to straightforward estimation approaches. Because 
we have a random sample on (y,w,x) from the relevant population, m(x) = 
E(y|x,w = 1) and mo(x) = E(y|x, w = 0) are nonparametrically identified. That is, 
these are conditional expectations that depend entirely on observables, and so they 
can be consistently estimated quite generally. (See Hardle and Linton, 1994, or Li 
and Racine, 2007, for assumptions and methods.) For the purposes of identification, 
we can just assume (x) and mo(x) are known, and the fact that they are known 
means that t(x) is identified. If m(x) and u(x) are consistent estimators (in an 
appropriate sense), using the random sample of size N, a consistent estimator of ATE 
under fairly weak assumptions is 


N 
Tate, reg = y= S [vin (xi) = mo(xi)| (21.27) 


i=] 


while a consistent estimator of ATE, is 


N lr N 
Fante = m) [S nm a mI} (21.28) 
i=l i= 


The estimators in equations (21.27) and (21.28) are called regression adjustment 
estimators of Tye and Tan, respectively. Notice that ĉar reg simply averages the 
differences in predicted values, 71) (x;) — 719(x;), over the subsample of treated units, 
Wi = 

The key implementation issue in computing Tare,rey ANd Tait,reg is how to obtain 
mo(-+) and 771(-). To be as flexible as possible, one could use nonparametric estima- 
tors, such as kernel estimators or series estimators. Kernel estimators use “local 
averaging” or “local smoothing” to estimate a function at a particular point. (See 
Li and Racine, 2007, for a comprehensive treatment of kernel regression with both 
continuous and discrete covariates.) In addition to being flexible, such local methods 
have the benefit of forcing us to confront problems with overlap in the covariate 
distribution. Consider equation (21.27). The function 719(-) is obtained using only 
those with w; = 0, and 7,(-) is obtained using only those with w; = 1. Thus, in 
obtaining the summand 71; (x;) — 17%9(x;) for, say, someone in the control group, we 
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must obtain u (x;), where u (-) was obtained only using those in the treatment 
group. If x; is very different from the covariate values of units in the treated sample, it 
is pretty hopeless to estimate m(x) at x = x; using local averaging methods. Simi- 
larly, if i denotes a treated unit, we need to evaluate mo(-), which is obtained using 
control units, at covariate values for a treated unit. Generally, the overlap assump- 
tion means that in a large enough data set, we will see both control and treated units 
for a given set of covariate values (or, at least, for “close” covariate values). If we do 
not see these units, we cannot hope to estimate Tate for the original population. 

Equation (21.28) makes it clear that the weaker overlap assumption, Assumption 
ATT, suffices for obtaining a satisfactory estimate, ĉar. Because Mu (-) is obtained 
using treated units, and equation (21.28) sums only over treated units, we allow the 
possibility that some units in the control group will have covariates very different 
from the range of covariate values in the treated subset. But we still need to obtain 
u(x) when i is a treated unit. Therefore, for every treated unit, we hopefully have 
some units in the control group with similar values for the covariates. 

Series estimation involves global approximation using flexible parametric models, 
where the flexibility of the model increases with the sample size. For a recent treat- 
ment, see Li and Racine (2007). Hahn (1998) shows that nonparametric regression 
adjustment using series estimators can achieve the asymptotic efficiency bound for 
estimating Tate. 

The problem with lack of overlap can be seen in a simpler setting. Suppose there 
is only one binary covariate, x, and Assumption ATE.1’ holds; for concreteness, x 
could be an indicator for whether pretraining earnings are below a certain threshold. 
Suppose that everyone in the relevant population with x = 1 participates in the 
program. Then, while we can estimate E(y|x = 1, w = 1) with a random sample from 
the population, we cannot estimate E(y|x = 1,w = 0) because we have no data on 
the subpopulation with x = | and w = 0. Intuitively, we only observe the counter- 
factual y; when x = 1; we never observe yọ for any members of the population with 
x = 1. Therefore, t,;.(x) is not identified at x = 1. 

If some people with x =0 participate while others do not, we can estimate 
E(y|x =0,w = 1) — E(y|x = 0,w = 0) using a simple difference in averages over 
the group with x = 0, and so Tatel(x) is identified at x = 0. But if we cannot esti- 
mate Tare(l), we cannot estimate the unconditional ATE because tye = P(x = 0)- 
Tate(O) + P(x = 1) - Tare(1). In effect, we can only estimate the ATE over the sub- 
population with x = 0, which means that we must redefine the population of interest. 
This limitation is unfortunate: presumably we would be very interested in the pro- 
gram’s effects on the group that always participates. 
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A similar conclusion holds if the group with x = 0 never participates in the pro- 
gram. Then T,,-(0) is not estimable because E(y|x = 0,w = 1) is not estimable. If 
some people with x = 1 participated while others did not, Tate(1) would be identified, 
and then we would view the population of interest as the subgroup with x = 1. There 
is one important difference between this situation and the one where the x = | group 
always receives treatment: it may be legitimate to exclude from the population people 
who have no chance of treatment based on observed covariates. This observation is 
related to the issue we discussed in Section 21.2 concerning the relevant population 
for defining Tare. If, for example, people with very high preprogram earnings (x = 0) 
have no chance of participating in a job training program, then we would not want to 
average together Tare(0) and Tae(1); Tare(1) by itself is much more interesting. 

Although the previous example is extreme, its consequences can arise in more 
plausible settings. Suppose that x is a vector of binary indicators for pretraining in- 
come intervals. For most of the intervals, the probability of participating is strictly 
between zero and one. If the participation probability is zero at the highest income 
level, we simply exclude the high-income group from the relevant population. (If 
we specify ty, as the object of interest, we also will exclude high-income people.) 
Unfortunately, if participation is certain at low-income levels, we must exclude those 
low-income groups as well. 

One can compute simple statistics to judge whether overlap is a problem. As 
described in Imbens and Rubin (forthcoming), one can compute the normalized dif- 
ferences, which take the form 


(Xij — Xo) 


(21.29) 
(si; + 55)) a 


where Xj is the sample average of covariate j for group g = 0,1 and sg; is the sample 
standard deviation. Imbens and Rubin suggest that normalized differences above 
0.25 are cause for concern. (Notice that the normalized differences are not the ż¢ sta- 
tistic for testing the difference in means. It is the difference in means standardized by 
a measure of dispersion that is important; the sample size should not play a direct 
role.) Unfortunately, even if the normalized differences are all small, they only focus 
on one feature of the marginal distributions. Overlap can still fail in more compli- 
cated ways. In the next subsection, we will discuss how to use the estimated propen- 
sity score to evaluate overlap. If one or more normalized differences are large, one 
might need to redefine the population of interest. 

It is useful to give a complete treatment of regression adjustment in the parametric 
case. (It is likely that the same analysis holds when the parametric model is allowed 
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to become more flexible as the sample size grows, at least under standard restric- 
tions.) Let mo(x, ôo) and mı (x, 61) be parametric functions, where mọo(x, do) is esti- 
mated using the w; = 0 observations and 1(x,6)) is estimated using the w; = 1 
observations. We have now covered several models and estimation methods that can 
be used. In the leading case, we can make the functions linear in parameters and use 
OLS, but we might want to exploit the nature of y; later on, we will discuss this topic 
further. 

Given /N-consistent and asymptotically normal estimators 6) and ôi, we have the 
following parametric regression adjustment estimate of Tate: 


N 
Tate, reg = N7! Nm (x; 61) = mo(Xi, 60) |; (21.30) 
i=1 


which will be N-consistent and asymptotically normal for Tate. Using Problem 
12.17, it can be shown that 


Avar vN (ĉare,reg = Tate) = E{ [m (x;, 01) = mo(X;, ôo) = Tate) } 


t E[V s mo(X;, do)|VoE| Vs, mo(X;, do)]' 


+ E[Vs, m (x;, 61) |] Vi E[V5, mı (xi, 51)]’, 


where Vo is the asymptotic variance of vN (do — ĉo) and similarly for Vı. This for- 
mula makes it clear that it is better to use more efficient estimators for dp and ô. A 
variance estimator is simple: 


N 
N - Avat (ĉate reg) = N! Xm (x; ô1) — mo(Xi, 60) — Dereal 
El 


N 
N`! 5 Vsmo(X;, ôo) Vo 


i=l i=1 


N 1 
N! 5 Vsmo(Xi, J 


=l J Va mı (x;, 61) JIVA 
i=l 


An alternative to an analytical expression is to use the bootstrap, which is straight- 
forward in this context. One simply includes the estimation of both mean functions, 
the calculation of the difference m\(x;, 61) — mo(X;, ôo), and then the averaging of 
these to obtain ĉate,reg in the same resampling scheme. All three sources of estimation 
error—in ôo, in ôi, and in replacing the expected value over the distribution of x; 


1S Vam X;, Ôi ji . (21.31) 


Estimating Average Treatment Effects 919 


with the sample average—will be properly accounted for. In fact, as described in Li 
and Racine (2007), the bootstrap is generally valid for nonparametric procedures, 
too. 

Given the estimated regression functions, it is easy to estimate ATEs over a subset 
of the population, say Tate, = E( yı — yo |x € Z), as 


N 
Tate, 2, reg = Ng! 5 1[x; € 2] - [m (xi, 61) — mo(Xi, do). 
i=l 
As mentioned earlier, linear regression is still the most popular method of regres- 
sion adjustment. Suppose that mo(x, 60) = % + xo and mı (x, 1) = «ı + xf,. Then 
(ĉo, Bo) are from the regression y; on 1, x; with w; = 0 and similarly for (a), $1). Then 


Tate, reg(X) = (ĉi m ĉo) F x(B; = Bo) (21.32) 
Tate, rey = (ĉi > ĉo) J x(B, = Ba) (21.33) 
Fate, reg = (ĉi T ĉo) +X, (Êi = Bo), (21.34) 


where ¥ is the sample average over the entire sample and x; = N,! XDA, wix; is the 
average over the treated subsample. The estimate of Tare, g is simply (a — ĉo) + 
Xa(ßı — Bo), where Xg is the sample average over the restricted sample. 

Replacing the vector x; with any functions h(x;) is trivial. The only minor point 
is that we should demean the regressors h; = h(x;) using h = N~! 37, h; (or over a 
subset); it makes no sense to use h(X). 

If we ignore the sampling variance in the sample averages, we can obtain a stan- 
dard error for ĉare,#,reg by using a pooled regression. Using the entire sample, run the 
regression 


yi on 1,w;i,x;, wi(x; — Xa); (21.35) 


the coefficient on w; is Tate, #,reg, and We can use the heteroskedasticity-robust stan- 
dard to form a ¢ statistic or confidence intervals. Technically, we should adjust for the 
estimation error in Xg, but the adjustment may have a small effect (see Problem 
6.10). Or, we can use the bootstrap to account for this uncertainty, too. Of course, 
we get Tuie,reg by using Xg = X. If we are mainly interested in the ATE over a sub- 
population, it may be better—by way of increasing overlap—to estimate the regres- 
sion functions on the restricted sample x; € 2. 

If the range of y is substantively restricted, we may wish to exploit that in estima- 
tion. For example, if y is a binary variable, or a fractional response, we can use a logit 
or probit function and the Bernoulli quasi MLE. If these functions are G(«, + xf,), 
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g =0,1 with 0 < G(-) < 1, let (dy, B,) be the Bernoulli QMLEs on the two sub- 
samples. Then 


tate, reg = =N l Sio (ĉi F x;ĝı) — G(ĉo F xiBo)] (21.36) 


is the estimate of Tate. As in the general case, G(a + xiB;) — G(d + xiBo) is the dif- 
ference in average response for unit i, given covariates x;, in the two treatment states 
(regardless of unit 7’s actual treatment status). If y > 0—such as a count variable, but 
not restricted to that case—an exponential regression function is sensible, in which 
case 


N 

tate,reg = N~' S“[exp(d + xfi) — exp(do + xiÂo)], (21.37) 
i=l 

where the parameter estimates can be obtained from a quasi-MLE procedure, such as 

Poisson or gamma regression, or nonlinear least squares. 

Equations (21.33), (21.36), and (21.37) make it clear that when parametric models 
are used for the means, it is possible to estimate Tate while essentially ignoring the 
overlap assumption. When using parametric models we are at least implicitly 
assuming that, say, mı(-, 1) holds for all x € 2, even though we only use the treated 
subsample to obtain ôi. But as in any regression context, we should be careful in 
extrapolating the estimated mean functions to values of x far from those used to 
obtain 6. The resulting estimate of Tate can be sensitive to the exact specification. For 
example, in the linear case it can be shown (for example, Imbens and Wooldridge, 
2009) that 


Tate,reg = (1 — Yo) z (Xi = Xo) (fob T fibo) (21.38) 


where fo = No/(No + N1) is the fraction of control observations and fi is the fraction 
of treated observations. If the difference in means x; — Xo is large, changes in the 
slope estimates can have a large effect on ĉare, reg. Therefore, as a general rule, one 
should not rely on parametric specifications as a way of overcoming poor overlap in 
the covariate distributions. 


21.3.3 Propensity Score Methods 


The expressions derived in Section 21.3.1 to establish identification of Tatre and 
Ta based on the propensity score lead directly to estimators. Practically, though, we 
need to estimate the propensity score function, p(-). For the moment, let p(x) denote 
such an estimator for any x € 2 obtained using the random sample {(w;,x;) : i = 
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1,..., N}. Then equation (21.22) suggests the estimator 


N N x 

5 z wyi (l- aa -i [wi — P(xi)] Vi 

Tate, psw = N i | x x =N rL na a/t 21.39) 
a 2, P(x) 1- p(x) > Berl — P(x:)] l 

while equation (21.26) suggests 

N Ps 

7 = [wi — p(xi)] yi 

Tatt, psw = N l ~-a ar A 21.40 
hr Dy = P (aran 


where p = N1 /N is the fraction of treated units in the sample. As is evident from the 
formulas, each estimate is simple to compute given the fitted probabilities of treat- 
ment, p(x;). Interestingly, the estimator in equation (21.39) is the same as an esti- 
mator due to Horvitz and Thompson (1952) for handling nonrandom sampling. Not 
surprisingly, these estimators are generally consistent under Assumptions ATE. 1’ and 
ATE.2 or Assumptions ATT.1’ and ATT.2, respectively, and suitable regularity 
conditions. If p(-) is obtained via parametric methods, consistency generally follows 
from Lemma 12.1. 

As for estimating the propensity score, Rosenbaum and Rubin (1983) suggest using 
a flexible logit model, where various functions of x—for example, levels, squares, and 
interactions—are included. For discrete components of x, one might define a set 
of dummy variables indicating the different possible values, and then interact these 
with continuous covariates and other dummy variables defined similarly. Clearly the 
sample size is important in deciding on how flexible the model can be. If we use a 
flexible logit or probit, or any index function G(-) with 0 < G(z) < 1 for all ze R, 
then there is no danger of p(x) = 0 or p(x) = 1, but ruling out zero and one from the 
fitted probabilities might simply mask the failure of overlap in the population—just 
as when we use parametric versions of the regression functions and then extrapolate 
beyond values of x used to estimate the two functions. 

Hirano, Imbens, and Ridder (2003) (HIR for short) study a nonparametric version 
of the Rosenbaum and Rubin (1983) approach. In particular, they study Tyre, psw when 
the propensity score is estimated using a flexible logit model, but they explicitly 
allow the number of functions of the covariates in the logit function to increase as a 
function of the sample size. Under regularity conditions and an assumption control- 
ling the number of terms in the logit estimation, HIR show that their nonparametric 
version of the Horvitz-Thompson estimator achieves the semiparametric efficiency 
bound due to Hahn (1998). Remarkably, the HIR estimator is more efficient—often 
nontrivially so—than the “estimator” that uses the known propensity scores, p(x;), in 
place of the nonparametric fitted probabilities, p(x;). Moreover, the HIR estimator is 
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strictly more efficient (except in special cases) than the estimator that uses a maxi- 
mum likelihood estimator for p(-). This fact is related to a result by Robins and 
Rotnitzky (1995): even if a logit model with a given number of terms is correctly 
specified for p(-), one generally reduces the asymptotic variance of Tate, psw by con- 
tinuing to add terms even though these have no affect on P(w = 1|x). The HIR 
estimator essentially takes the Robins and Rotnitzky (1995) result to the limit. 
(Shortly, we will see how the efficiency gains work in a parametric setting.) 

An alternative to global smoothing methods such as the HIR series estimator is to 
use local smoothing, such as kernel regression. Li, Racine, and Wooldridge (2009) 
have recently proposed using kernel smoothing that allows both continuous and dis- 
crete covariates. 

It is informative to show how to obtain the asymptotic variances of 
VN (ĉate,psw — Tate) and VN (ĉatt,psw — Tatt) in the parametric cases where we assume a 
correctly specified model for p(x) and use the Bernoulli MLE. For ĉate, psw, we can 
directly apply the “‘surprising” efficiency result in Section 13.10.2. The ignorability- 
of-treatment assumption (conditional on x, of course) is easily shown to imply the 
key conditional independence assumption in equation (13.66) (with notation properly 
adjusted). Because ĉaste, psw is a sample average, we can work directly off of its “first- 
order condition.” Let p(x, y) be the correctly specified propensity score model, and 
define the score from the first-stage propensity score estimation as 

Vy p(x, p) [wi = P(X, 7)] 


d; = d(w;,X;, y) = ; 
aX 7) =~ Tey, ll anr] 


where y is used here to also denote the true population value. Further, define 


[wi - Px) yi 
PERD- p(x, 9)]’ (21.41) 


which are the summands in ĉase psw With the true propensity score inserted. Then the 
asymptotic variance of WN (tare. psw — Tate) is simply Var(e;), where e; is the popula- 
tion residual from the regression k; on dj. We can easily estimate this variance by 
using the sample version. Let 


dj = d(w;,x;,)) = Vi P(x, Ð) [wi — px, P) 
a P(x, 9)[1 — p(x: 9) (21.42) 


be the estimated score, and let 


PDL- px) (21.43) 


i = 
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Then, obtain the OLS residuals, é;, from the regression 


ki on 1,d/, i=1,...,N, (21.44) 


Mid 


where the inclusion of unity ensures that we remove the sample average of k; (which 
iS Tare, psw) in estimating the variance. Given these residuals, the asymptotic standard 
error Of Tate, psw 1S 


N 1/2 
ps 5 J [ve (21.45) 
i=1 


The adjustment is particularly simple in the case of logit estimation of the pro- 
pensity score because then 


d! = h,(w; — Î;), (21.46) 


where h; = h(x;) is the 1 x R vector of covariates (including unity) and p,; = A(h;f) = 
exp(h;f)/[1 + exp(h;f)]. We can also see that, as we add elements to h;, the popula- 
tion residuals (and the sample conterparts) from the regression k; on h;(w; — p;) will 
shrink provided the extra elements in h;(w; — p;) are partially correlated with k;. This 
is generally the case even though the extra functions in h; have zero population 
coefficients in P(w; = 1|x;). This was the result noted by Robins and Rotnitzky 
(1995). 

If we ignore estimation of the propensity score, we effectively treat {ki z= 
1,...,N} as being drawn from a random sample, and we use its sample average to 
estimate the population mean, Tate. The naive standard error that we obtain is 


N 1/2 
N`! 5 (k — tam [ve (21.47) 
Hl 


and this is at least as large as expression (21.45), and sometimes much larger. In the 
population, the comparision is Var(k;) versus Var[k; — L(k;|d;)], where L(k;|d/) is 
the linear projection. Netting out L(A; | d/) produces a smaller variance unless k; and 
d; are uncorrelated. 

We can also find an appropriate standard error for ĉar, psy. Write 


N 
r slari x 
Tatt,psw = P N ` qi» 
i=1 


where 9; = [wi — p(xi,9)] ¥i/[1 — p(xi,7)]. Now 
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N N N 
VN (ĉan psw = Tati) = pone SoG = PT aut) = p! pre 5 ĝi a Tau N? 5 Wi 
i=l il i=l 


x N 
= p` Note - ri — TaN D Wi a (1), 
i=l l 


where the r; are the population residuals from regressing q; on d; (using the same 
argument as for Tare, psw). Therefore, 


N 
VN (Fate. psw E Tatt) = pon? Sori = TattWi) F op(1), 
i=l 


and so estimation of the asymptotic variance is straightforward. The asymptotic 
standard error of ĉar, psw iS 


N 1/2 
p! N`! Soh 7 Eat pW) VN, (21.48) 
j=l 


where 7; are the residuals from the regression ĝ; on di : 
A different use of the estimated propensity scores is in regression adjustment. A 
simple, still somewhat popular estimate is obtained from the OLS regression 


yi on l,w;, p(x), i=1,...,N; (21.49) 


the coefficient on w;, say, Tare,psreg, 18 the estimate of Tate. The idea is that the estimated 
propensity score should be sufficient in controlling for correlation between the treat- 
ment, w;, and the covariates, x;. As it turns out, there is a simple characterization 
of when Tyre, psreg Consistently estimates Tare. The following is a special case of Wool- 
dridge (1999c, Proposition 3.2). 


PROPOSITION 21.3: In addition to Assumptions ATE1.1’ and ATE.2, assume that 
Tate(X) = E( yı — yo | xX) is uncorrelated with Var(w |x) = p(x)[1 — p(x)]. Then, under 
standard regularity conditions that ensure that ĵ (in a parametric estimation problem) 
is consistent and VN -asymptotically normal, tyre, psreq 18 consistent for Tare and VN- 
asymptotically normal. 


The assumption that Tare(X) = 4 (X) — p(x) is uncorrelated with Var(w|x) may 
appear unlikely, as both are functions of x. However, remember that correlation 
measures linear dependence. It would not be surprising for Tate(X) to be monotonic in 
many elements of x, while p(x)[1 — p(x)] is a quadratic in the propensity score. If 
so, the correlation between these two functions of x might be small. (This comment is 
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analogous to the fact that if z has a symmetric distribution around zero, z and z? are 
uncorrelated even though z? is an exact function of z.) In the case where the treat- 
ment effect is constant, so that Tare(X) = Tate, zero correlation holds, and ĉare,psreg 
consistently estimates Tate without further assumptions. 

The result in Section 13.10.2 can be used to obtain a proper standard error for 
Tate, psrey — Whether or not different slopes are allowed. But we also know that ignor- 
ing the propensity-score estimation results in asymptotically conservative inference. 
Because regression on the propensity score is computationally simple, bootstrapping 
the propensity score and regression estimation is attractive as a way to obtain proper 
standard errors and confidence intervals. By resampling the entire vector (y;, wi, Xi), 
even the estimation error in Å, is easily accounted for. 

Rosenbaum and Rubin (1983) take a different approach to deriving regression on 
the propensity score and matching methods discussed in the next subsection. A key 
result is that under Assumption ATE.1, treatment is ignorable conditional only on 
the propensity score, p(x). For completeness, we provide a simple proof. 


PROPOSITION 21.4: Under Assumption ATE.1, w is independent of (yo, y1) condi- 
tional on p(x). Therefore, E[y|w = 0, p(x)] = E[yo| p(x)] and E[y|w = 1, p(x)] = 
yı | p(x)]. 


m 


Proof: Because w is binary, it suffices to show that P[w = 1| yo, y1, p(x)] = 
w= 1|p(x)] or Elw] yo, 1, p(x)] = E[w| p(x)]. But, under Assumption ATE.1, 
w| yo, y1; X] = E[w|x] = p(x), where the second equality follows because w is 
binary. By iterated expectations, 


my 


E(w | yo, y1, P(X)] = E[E(w| yo, 1, X) | Yo, y1; P(X)] = E[p(x) | yo; 1, pP(X)] = p(x), 


which shows that ignorability of treatment holds conditional on p(x). 
Next, write y = (1 — w) yo + wy, as usual. Then 


E[y | w, p(x)] = (1 — w)E[yo | w, p(x)] + wE[yi | w, pOx)] 
= (1 — w)E[yo | p(x)] + wE[y1 | p(x)]. 
Inserting w = 0 and w = 1, respectively, gives the results for the conditional means. 


Proposition 21.4 has many uses. For one, it implies that, since E[y, | p(x)] is iden- 
tified, g = 0, 1, we can identify tae = E{E[1 | p(x)] — Elo | p(x)]}. In particular, let 


ro(p) = Ely|w=0, p(x) =p],  r(p) = Ely|w= 1, p(x) = p] (21.50) 
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be the two regression functions that can be identified given the data. Given consistent 
estimators ĉo(P;), 71(p;), Where p; = p(x;) are the estimated propensity scores, we get 
as a general estimate 


N 
are, preg = N7! YOP) — fo(d)- (21.51) 
i=l 


Generally, this estimator is consistent and /N-asymptotically normal provided we 
have the models for r,(-), g = 0,1 and the propensity score correctly specified (or we 
use suitable nonparametric methods). Given the nature of Tare, psreg, bootstrapping is 
an attractive method for inference. 

Heckman, Ichimura, and Todd (1998) propose local smoothers to estimate ro(-) 
and r;(-); this proposal is attractive because these are functions of a scalar. But one 
must also estimate the propensity score, and this is often a high-dimensional problem. 
Hahn (1998) proposed series estimation of ro(-) and rı(-) along with series estimation 
of p(-). A simpler approach is just to use parametric functions and parametric 
asymptotics. 

If ro(-) and r;(-) are linear, say, r4(p) =, + 7gp, g =0,1, then estimation is 
straightforward. Two separate regressions can be run, and then the differences in 
predicted values, 7;(p;) — 7o(p;), are averaged to get Tare, psreg. Equivalently, run the 
regression 


yi on 1,w; p(x), wi-[p(xi) — pf], i=1,...,N, (21.52) 


where / is a consistent estimate of p = E[p(x;)] = P(w; = 1). If p(x;) is from a logit 
that includes an intercept, the two natural estimates, w and N7! y | P(xi), are 
identical. The coefficient on w; iS Tare, psreg- If we ignore estimation of p, the usual 
heteroskedasticity-robust standard error will be conservative, but bootstrapping can 
be obtained to get the proper standard error. 

The linear models for E[y, | p(x)] might be too restrictive because 0 < p(x) < 1. If 
y has substantial variation, different functions might be needed. One can always 
use polynomials in p;—as formally studied in a series setting by Hahn (1998). A 
log-odds transformation might improve the fit, where /odds; = log[p;/(1 — p;)| and 
functions of it (such as polynomials) are used as regressors. It is easy to use two sep- 
arate regressions and then average the differences in predicted values. 

Given that rı (-) is estimated using the treated subsample and ro(-) using the control 
sample, we can see the nature of the overlap assumption. In effect, for each value 
of the propensity score in the interval (0,1), we should observe both treated and 
untreated units. We can easily use the estimated propensity scores to determine 
whether lack of overlap is a problem. For example, we can compute the normalized 
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difference of the propensity scores (or its log-odds ratio), as in equation (21.29), using 
the propensity score in place of the covariates. It is also informative to plot histo- 
grams of the p, for the control and treatment groups. We hope to rule out situations 
where the histograms show bins with data on the control units but no or very few 
data for the corresponding treatment units, and vice versa. 

Regression on the propensity score seems attractive because, compared with 
regression adjustment using x, it reduces the problem of controlling for a possibly 
large set of covariates, x, to controlling for a single function of them, p(x). But 
the parsimony afforded by propensity score regressions is somewhat illusory because 
we should use a flexible model for the propensity score. Even if we settle on logit or 
probit, we would typically include flexible functions of x, just as if we were doing 
regression adjustment. One might argue that, because the treatment w; is binary, we 
have better leads on suitable models for P(w = 1 | x). Nevertheless, as we discussed in 
Section 21.3.2, we can account for the nature of y when applying regression adjust- 
ment. If we use nonparametric approaches, estimating either p(x) or the regression 
functions m,(x) = E(y|w = g,x) generally requires high-dimensional nonparametric 
estimation. (Incidentally, if we use a linear probability model for p(x), regressing y; 
either on the covariates themselves or the propensity score gives the same estimate 
of Tate.) 

It is easy to see why regression on the propensity score is generally inefficient. 
Suppose that assignment is random so that w is independent of (yo, y1, x), but x helps 
to predict the counterfactual outcomes. Further, assume that we actually know the 
propensity score, so that regression on the propensity score is y; on 1, w; p(x;). 
Because of random assignment, we do not need to include p(x;) in order to obtain a 
consistent estimator of Tare: the simple difference in sample averages is consistent. 
Adding p(x;) can actually improve efficiency over the difference-in-means estima- 
tor if p(x) appears in the linear projections L[y, | 1, p(x)]: adding p(x) will shrink 
the error variance. But p(x) is hardly the best function of x to add to the regres- 
sion. In the case of a constant treatment effect, the best function of x to add is 
M(x) = E( yo | x) because this approach leads to the smallest possible mean squared 
error among functions of x. In effect, the error variance is made as small as possible. 
Of course, we do not know this mean function, but we can approximate it much 
better using flexible functions of x than by using a linear function of just the pro- 
pensity score. 

We now show how regression methods and inverse probability weighted (IPW) 
methods can be used to estimate treatment effects of job training on labor earnings. 
The underlying data are from Lalonde (1986), although the particular data set used 
here is from Dehejia and Wahba (1999). 
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Example 21.1 (Causal Effects of Job Training on Earnings): We will use two data 
sets for this example, JTRAIN2.RAW and JTRAIN3.RAW. The first data set is 
from a job training experiment back in the 1970s for men with poor labor market 
histories. Being in the treatment group, indicated by train=1, was randomly 
assigned. Out of 445 men, 185 were assigned to the job training program. The 
response variable is re78, real labor market earnings in 1978. (This variable is zero for 
a nontrivial fraction of the population.) Training began up to two years prior to 
1978. The controls that we use here are age (in years), years of education, dummy 
variables for being black, being Hispanic, and marital status, and real earnings for 
the two years prior to the start of the program, 1974 and 1975. For simplicity, these 
all appear in level form without interactions or other functional forms. 

JTRAIN3.RAW contains the same information as JTRAIN2.RAW, but the con- 
trol group in JT[RAIN3.RAW was obtained as a random sample from the Panel 
Study of Income Dynamics. Thus, JTRAIN3.RAW is essentially a nonexperimental 
version of JTRAIN2.RAW, which allows us to determine how well the non- 
experimental methods for estimating treatment effects compare with experimental 
estimates (Lalonde’s original motivation). 

We use four estimation methods: a simple comparision of means, regression 
adjustment pooling across the control and treated groups, regression adjustment 
using separate regression functions, and IPW estimation using the propensity score. 
In each case, the controls are (age, educ, black, hisp, married, re74, re75). Note that 
for the difference-in-means and pooled regression adjustment there is no difference 
between ĉaste and Tai. 

Before presenting the estimates, we note that there is a severe problem with overlap 
in JT[RAIN3.RAW. For example, for the variable re75, the absolute value of the 
normalized difference in equation (21.29) is about 1.25, well above the Imbens and 
Rubin rule of 0.25. Other covariates have normalized differences above unity, too. 
Therefore, unless we think the ATE is constant, we can expect problems in trying to 
estimate Tate and, to a lesser degree, Tar, even under ignorability. 

Table 21.1 contains the estimates and the standard errors. For the first two esti- 
mation methods, we use the heteroskedasticity-robust standard error from the OLS 
regressions; by assumption, Tatt = Tate. For regression adjustment using separate 
regression functions, as well as propensity score weighting, we use 1,000 bootstrap 
replications in Stata 10. It is slightly more difficult to use the analytical formulas 
derived in equation (21.31) and expression (21.47). 

Not surprisingly, using the experimental data the estimates are consistent across 
estimation method. Because of random assignment, Tatt = Tate, SO it is also expected 
that the estimates of these parameters would be similar. The job training program 


Estimating Average Treatment Effects 929 


Table 21.1 
JTRAIN3 JTRAIN3 
JTRAIN2 (Full Sample) (Reduced Sample) 
Estimation Method Tate Catt Fate Catt Tate Catt 
Difference in means 1.794 1.794 —15.205 —15.205 —5.005 —5.005 
(0.671) (0.671) (0.656) (0.656) (0.657) (0.657) 
Pooled regression adjustment 1.683 1.683 0.860 0.860 2.059 2.059 
(0.658) (0.658) (0.767) (0.767) (0.801) (0.801) 
Separate regression adjustment 1.633 1.774 —8.910 931 —2.340 2.323 
(0.642) (0.661) (3.721) (0.794) (1.480) (0.817) 
Propensity score weighting 1.627 1.798 11.029 1.627 7.049 1.649 
(0.637) (0.660) (40.809) (0.835) (17.295) (0.986) 
Sample size 445 445 2,675 2,675 1,162 1,162 


is estimated to increase real labor market earnings in 1978 by between $1,627 and 
$1,798. This is a huge effect considering that the average value of re78 for the 
untreated sample is $4,554. 

The pattern of estimates is very different when the nonexperimental data are used. 
It is not too surprising that the simple comparison-of-means estimate, which provides 
no control for self-selection into training, is negative and large. But the other 
estimates of Tare also appear unreliable. The estimate of Tare using propensity score 
weighting is suspiciously large (though very imprecise). In fact, of the 2,675 observa- 
tions, the logit model for the propensity score perfectly predicts 158 of the train = 0 
outcomes. In effect, we are dividing by zero (but in practice it is a very small num- 
ber). Before one uses propensity score weighting, one should study the distribution of 
the propensity score. As discussed earlier, one might reduce the sample to observa- 
tions with, say, p(x;) between 0.10 and 0.90, or 0.05 and 0.95. 

Controlling for the covariates, either through regression adjustment or propensity 
score weighting, provides sensible estimates of t,,,, but the estimates are imprecise. 
The propensity score weighting is sensible for Ta because small probabilities do not 
affect ĉan (see equation (21.40)). The largest estimated propensity score in the entire 
sample is about 0.936. It is clear by looking at simple summary statistics of the pro- 
pensity score that lack of balance is a serious problem: in the treated subsample, the 
average propensity score is about 0.631; in the control subsample, it is about 0.027. 

To address the lack of overlap somewhat crudely, we also use a subset of the data 
in JTRAIN3.RAW. We use only observations where the average of re74 and re75 
is less than or equal to $15,000. This choice is essentially arbitrary but provides 
a mechanism for dropping men who likely have little chance of being part of such a 
program. The estimates are given in the last two columns of Table 21.1. Restricting 
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the sample produces more sensible estimates, but those for Tate are still unstable. 
Again, the three estimates of Tan that account for covariates are fairly stable but 
not especially precise. A more careful analysis of these data would use more flexible 
functional forms and cause one to think more carefully about how one might restrict 
the sample. 


21.3.4 Combining Regression Adjustment and Propensity Score Weighting 


In the previous two subsections, we described methods for estimating ATEs based 
on two strategies: the first is based on estimating “,(x) = E( yy |x) for g = 0,1 and 
averaging the differences in fitted values, as in equation (21.27), and the second is 
based on propensity score weighting, as in equation (21.39). For each approach we 
have discussed estimators that achieve the asymptotic efficiency bound. If we have 
large sample sizes relative to the dimension of x;, we might think nonparametric 
estimators of the conditional means or propensity score are sufficiently accurate to 
invoke the asymptotic efficiency results. 

In other cases we might choose flexible parametric models because full non- 
parametric estimation is difficult. As shown earlier, one reason for viewing estimators 
of conditional means or propensity scores as flexible parametric models is that it 
simplifies standard error calculations for treatment effect estimates. But if we use 
standard error calculations that rely on parametric models, we should admit the 
possibility that those parametric models are misspecified. As it turns out, we can 
combine regression adjustment and propensity score methods to achieve some 
robustness to misspecification of the parametric models. The resulting estimator of 
Tate 1S said to be doubly robust because it only requires either the conditional mean 
model or the propensity score model to be correctly specified, not both. 

The idea behind the doubly robust estimators is developed in Robins and Rot- 
nitzky (1995), Robins, Rotnitzky, and Zhao (1995), and van der Laan and Robins 
(2003). Wooldridge (2007) provides a simple proof that double robustness holds for 
certain combinations of conditional mean specifications and estimation methods. 
To describe the approach, let mo(-,69) and mı(-,ô1) be parametric functions for 
E(y |x), g = 0,1, and let p(-,y) be a parametric model for the propensity score. In 
the first step we estimate y by Bernoulli maximum likelihood and obtain the esti- 
mated propensity scores as p(x;,?) (probably logit or probit). In the second step, we 
use regression or a quasi-likelihood method, where we weight by the inverse proba- 
bility. For example, if we use linear functional forms for the conditional mean, to 
estimate 0) = («1,21 )” we would use the IPW linear least squares problem 


N 
; 2 5 

min wi( Yi — %1 — X; D(X, 7); 21.53 

min 3 (vi = %1 = xip) /Pxi P) (21.53) 
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for do, we weight by 1/[1 — p(x;)] and use the w; = 0 sample. Then, we estimate Tate 
as the average of the difference in predicted values, 


N 
fare poureg = NYO Kê + xÂ.) — (Go + xÂ) (21.54) 
i=l 
This is the same formula as linear regression adjustment, but we are using different 
estimates of æg, B,, g = 0,1. 

To carefully describe the double robustness result, denote the probability limits of 
ĵ, ôo, and ôi by y*, ô), and ôï, respectively. Now, if the conditional mean functions 
are truly linear and ignorability holds conditional on x;, then weighted least squares 
estimators using any function of x; consistently estimate 6) on the control sample and 
ôi on the treated sample. We covered the general case for missing data in Section 
19.8 (see also Wooldridge, 2007). Therefore, even if p(x, y) is arbitrarily misspecified, 
weighting by functions of p(x;,y*) does not cause inconsistency for estimating the 
parameters of the correctly specified conditional mean. (Replacing y* with 7 does 
not change this claim because f converges in probability to y*.) Now if E(y,|x) = 
a7 + xpi, g = 0,1, then Tare = E[(ay + xB}) — (x + xf5)|, which is the usual identi- 
fication result for Tate for regression adjustment. 

The other half of the double robustness result is more subtle. Now suppose that 
P(x,y) is correctly specified for P(w = 1 |x) but allow for the possibility that the 
conditional means are not linear. As we discussed in Section 19.8, the IPW esti- 
mator (under ignorability) recovers the solution to the unweighted minimization 
problem in the population. In the linear regression case, that means 3; minimizes 
El( yg — % — xB,)’] for g= 0,1. In other words, the ô; are the parameters in the 
linear projection L(y, |1,x); because we include a constant in this projection, the 
unconditional mean of y, is the mean of the linear projection, E(y,) = E[(aj + xB;)]- 
Therefore, tare = E[(a + xP}) — (aj + xf 5)], just as before, except now we do not 
need the linear functions to be correctly specified for the conditional means. In other 
words, linear regression adjustment can still produce a consistent estimator of Tate 
when the conditional means are misspecified, but we must use IPW estimation with 
a correctly specified model for the propensity score. The argument is essentially 
unchanged when we replace x with any functions h(x). 

As illustrated for the linear regression case, the key to the second part of the double 
robustness property—that is, when the conditional means are misspecified—is that 
we can still recover the counterfactual unconditional means, E(y,), as E( y4) = 
E[m,(x,6,)|. Thus, to extend the double robustness result to nonlinear conditional 


ae) 
mean models, we need to find combinations of conditional mean functions and 
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objective functions with this property, where, again, the ò; solve the population 
optimization problem. This is a special property, but a couple of very useful cases 
are known. They turn out to be quasi-maximum likelihood estimators in particular 
linear exponential family distributions with particular conditional mean functions. 
We already saw the linear case with the least squares (normal log likelihood) ob- 
jective function. If y, is a binary or fractional response, the mean function that 
delivers double robustness, along with the Bernoulli quasi-log likelihood, is the 
logistic function 


m(x, dg) = Alay + h(x)B,] (21.55) 


where h(x) can be any function of x. In other words, for the treated group, we solve 
the IPW QMLE problem 


N 
min >| wi{(1 — yi) log|1 — A(x + hyB,)] + yilog[A(o + hif,)]}/p(x, 9), 
REL g 


where h; = h(x;). Once we have used IPW estimation in each case, the ATE is esti- 
mated as before: 


Tate, „pswreg = =N : yin (a1 F h;ĝ;) — A (ĉo T h;ĝo)]. 


If E(yy |x) = Afa} +h(x)£}], g=0,1 or P(w =1|x) = p(x, y*), then tate,pswreg > 
Tate: We might very well use a logit model for P(w = 1 |x), but that is not necessary. 
Also, notice that if we replace A(-) with, say, the standard normal cdf, ®(-), in 
equation (21.55), we lose the double robustness property. 

If y, is a nonnegative response—it could be continuous, discrete, or have both 
features—an exponential mean function coupled with the Poisson QLL delivers dou- 
ble robustness. With h; = h(x;) functions of x;, we solve an IPW Poisson estimation, 


N 
mn wil vil + biB)) — exp(oa + h;b1)]/P:, 9), 
for the treated sample, and similarly for the control sample. The average treatment 
effect then has the same form as equation (21.37). 

Why do the Bernoulli and Poisson QMLEs with, respectively, the logistic and 
exponential mean functions yield doubly robust estimators of Tatre? Again, the first 
half of double robustness is straightforward. The Bernoulli and Poisson QMLEs are 
fully robust for estimating the parameters of a correctly specified mean regardless of 
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the nature of y; (and with any mean function); weighting by a strictly positive func- 
tion of x;, either 1/p(x;,y*) or 1/[1 — p(x;,y*)], does not change the robustness 
property when assignment is ignorable. This reasoning gives the first half of the double 
robustness. For the second half, we need the specific conditional mean functions for 
the corresponding QLL. When p(x, y) is correctly specified, so that P(w = 1|x) = 
p(x,y*), the IPW estimator consistently estimates the solution to the unweighted 
population problem—as always. The key is that for the two combinations of mean/ 
QLLs just described, the probability limits satisfy E(y,) = E[m,(x,6,)]. These are 
easy to establish by studying the first-order conditions for 6;. For example, in the 
Poisson case, ô; satisfies (for a random draw i) 


E[(1,hy)'yig] = E[(1, hy)’ exp(aj + hiB3)), 


and the first of these equations is simply E( y) = Elexp(aj + hf; )]. A similar cal- 
culation establishes E(y,) = E[A (xf + h(x)f;)| in the logistic/Bernoulli case. (The 
sample analogue of these population conditions also holds: the sample average of y; 
is equal to the average of the fitted values.) 

For the previous three cases discussed, the mean functions for the normal (squared 
residual), Bernoulli, and Poisson quasi-log likelihoods correspond to the “canonical 
link” functions in the language of generalized linear models. This correspondence 
is not a coincidence. The mean function associated with a canonical link always 
has the property E(y,) = Elm,(x,d;)] provided a constant is included in the index 
function—as would always be the case in treatment effect applications. It is impor- 
tant to remember that these three conditional mean/QLL combinations can be used 
for a variety of response variables. As a practical matter, we should ensure that 
the chosen mean function is logically consistent with the nature of y,. If y, has 
unbounded support and takes on positive and negative values, a linear model seems 
natural. If y, is binary or fractional (with possible mass at zero or one), the logit 
approach seems natural. And if y, > 0 and unbounded, the exponential function 
(combined with the Poisson QLL) seems natural. Remember, the Poisson QMLE can 
be applied to any kind of yg. 

Because the estimates of Tare take the form of equation (21.30), we can use equation 
(21.31)—with proper formulas for Vo and V;—to compute an asymptotic standard 
error for tyre, pswreg. If the conditional means are correctly specified, the usual robust 
variance matrix is valid for the parameters using the weighted quasi-MLEs. If the 
conditional means are misspecified but the propensity score is correctly specified, a 
better (and smaller) estimate of the asymptotic variances of the 5, is obtained by 
netting out the gradient of the propensity score log likelihood from the weighted, 
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selected score for the QMLE, just as in Section 19.8; see also Wooldridge (2007). We 
might just use the variance matrix that ignores estimation of y* because, at worst, it 
is conservative. However, there is no harm in adjusting for the MLE estimation of 
y* because, if the means are correctly specified, the adjustment will be minimal in 
large samples. Of course, bootstrapping the two-step procedure and the formula for 
Tate, pswreg iS easy and provides asymptotically correct inference. 


21.3.5 Matching Methods 


The motivation for matching estimators is similar to regression adjustment. In par- 
ticular, for each i, we impute values for the counterfactuals, yj and yj. Matching 
estimators use the observed outcomes when possible. In other words, if we let fọ and 
yj, denote the imputed values, f = y; when w; = 0 and y;, = y; when w; = 1. Gen- 
erally, matching estimators take the forms 


Tate,match = = = WG — Jin) (21.56) 


Tat, match = Ni pa Wi — Sin); (21 .57) 


where the latter formula uses the fact that y; = y; for the treated subsample. (In 
other words, we never need to impute y; for the treated subsample.) 

The key question in matching is how to impute y;o for the treated units (for both 
Tate aNd Tay) and how to impute y; for the control units (for Tare). Abadie and Imbens 
(2006) consider several approaches, each of which involves finding one or more 
matches based on the covariate values. In the simplest case, a single match is found 
for each observation. For concreteness, suppose į is a treated observation (w; = 1). 
Then Jj; = yj and Yip = Yq) for h(i) such that wa) = 0 and unit h(i) is “closest” to 
unit 7 in the sense that x}; is “closest” to x; based on a chosen metric (or distance). 
In other words, for the treated unit i we find the “most similar” untreated observa- 
tion, and use its response as our best estimate of yi. Similarly, if w; = 0, Pj = yi and 
În = Yn) Where now wha) = 1. If we choose matches based on the full set of co- 
variates, we call this method matching on the covariates. If we settle on a single match 
for each unit and we have the list of covariates, the only issue is in choosing the dis- 
tance measure. A common metric is the Mahalanobis distance, which for observa- 
tions h and i is (the square root of) (xj, — x;) Èz! (x, — x;), where Èy is the sample 
K x K variance-covariance matrix of the covariates. Some i packages use a diagonal 
version as the default, which gives the weighted average pe K (me — xy)" / ô’. 


Estimating Average Treatment Effects 935 


Rather than using a single “nearest neighbor,” we can impute the missing values 
using an average of M nearest neighbors. If w; = 1 then 


Jo=M" X y (21.58) 
heXm(i) 


where Xj/(i) contains the M untreated nearest matches to observation i, again based 
on the covariates. In particular, for all h e Ny(i), w, = 0. (With ties in the distances, 
there can be more than M elements in 8j/(i), and then M is replaced with the number 
of elements in NXy(i).) Similarly, if w; = 0, 


In =M' SO yp (21.59) 


heuli) 


where 3,(i) contains the M treated nearest matches to observation i. 

Matching on the full set of covariates can be computationally intensive, but with 
modern computers the burden is manageable. The method produces a consistent 
estimator Of Tare under the ignorability-in-mean Assumption ATE.1’ along with 
overlap Assumption ATE.2. (Naturally, the weaker assumptions are sufficient for 
Tan.) Matching can be motivated by the following thought experiment. Suppose that 
we draw a value x from the distribution of covariates in the population. Then, for the 
given covariate values, we randomly draw a control unit and a treated unit from the 
subpopulation and record the outcomes. The expected difference in the outcomes is 


E(y | w= 1, x) B E(y | w= 0, x) = Tate(X), 


where the equality holds under Assumption ATE.1’. By iterated expectations, if we 
average the difference in outcomes across the distribution of x, then we have Tare. 
Matching is just the sample analogue of the thought experiment. 

It is clear that lack of overlap will cause problems for matching estimators, just as 
with regression adjustment and propensity score weighting. Suppose i is a control 
unit and x; is very far from all the covariate values in the treated subsample. Then the 
match used to obtain fọ could be very poor, and averaging several poor matched 
values need not help. Fundamentally, if there are regions in the support 2 without 
both control and treated units, matching can produce poor results. 

The large-sample properties of covariate matching have been obtained by Abadie 
and Imbens (2006); see also Imbens and Wooldridge (2009). Unless K = 1 (match- 
ing on a single covariate), matching estimators are not v N-consistent, and the bias 
dominates the variance when K > 3. See Abadie and Imbens (2006) for bias and 
variance calculations, and for a bootstrap procedure for conducting valid inference. 
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Because of Proposition 21.4, matching on the propensity score also produces con- 
sistent estimators of Tare and Tan, which is very convenient because p(x) is a scalar 
in the unit interval. If we knew the propensity score, we could apply the results of 
Abadie and Imbens (2006); in particular, the estimator would be V/N-consistent and 
variance calculations would be fairly straightforward, as would applying the boot- 
strap. Rosenbaum and Rubin (1983) first proposed propensity score matching using 
propensity scores obtained from a preliminary logit estimation. Unfortunately, using 
an estimated propensity score complicates the statistical properties of the matching 
estimator because matching is an inherently discontinuous operation. In particular, if 
standard matching methods are used—for example, A(i) is chosen as the match for 
observation i if wy) = 1 — w; and h(i) = argmin,|p(x;,) — p(x;)|—then bootstrapping 
is no longer justified. (Sometimes a different function of the propensity score, such as 
the log-odds ratio, is used in the matching, but that in itself does not fix the problems 
with bootstrapping.) Various methods of smoothing propensity score matching have 
been proposed; see, for example, Frölich (2004). 

Matching methods can also be combined with regression adjustments. See Imbens 
(2004) and Imbens and Wooldridge (2009) for surveys. 


Example 21.2 (Causal Effect of Job Training on Earnings): We now compute 
matching estimates of the causal effects from Example 21.1. We use a single-match 
and diagonal-weighting matrix with the inverse of the variances down the diagonal. 
The standard errors are computed by Stata 10 using the methods in Abadie and 
Imbens (2006). Table 21.2 contains the results. 

Not suprisingly, on the experimental data set, the matching estimates are very 
similar to the regression-adjustment and propensity-score-weighting estimates ob- 
tained in Example 21.1. The estimates are somewhat less precise than the regression 
and propensity score estimates. 

Unfortunately, like the other methods, matching does not work very well on the 
nonexperimental data—even when the data set is restricted such that (re74 + re75) /2 
< 15. Further work would need to be done to obtain a sample with better overlap. 


Table 21.2 
JTRAIN3 JTRAIN3 
JTRAIN2 (Full Sample) (Reduced Sample) 
Estimation Method Tate Catt Tate att Tate Catt 
Covariate matching 1.628 1.824 —12.869 0.155 —3.846 —0.232 
(0.773) (0.882) (3.815) (1.478) (2.495) (1.649) 


Sample size 445 445 2,675 2,675 1,162 1,162 
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21.4 Instrumental Variables Methods 


We now turn to instrumental variables estimation of average treatment effects when 
we suspect failure of the ignorability-of-treatment assumption (ATE.1 or ATE.1’). IV 
methods for estimating ATEs can be very effective if a good instrument for treatment 
is available. We need the instrument to predict treatment (after partialing out any 
controls). As we discussed in Section 5.3.1, the instrument should be redundant in a 
certain conditional expectation and unrelated to unobserved heterogeneity; we give 
precise assumptions in the following subsections. 

Our primary focus in this section is on the average treatment effect defined in 
equation (21.1). Section 21.4.1 considers IV estimation, including methods where 
fitted values from estimation of a binary response model for treatment are used as 
instruments—and makes the case for preferring fitted values as instruments rather 
than regressors. Section 21.4.2 provides methods for estimating the ATE that can be 
used when the gain to treatment depends on unobservables (as well as observables). 
Two approaches are suggested; one adds a “correction function” and applies IV 
to the resulting equation. The other is based on a control function approach. In 
Section 21.4.3 we briefly discuss estimating the local average treatment effect by 
instrumental variables when we are not willing to make functional form or dis- 
tributional assumptions. 


21.4.1 Estimating the Average Treatment Effect Using IV 
In studying IV procedures, it is useful to write the observed outcome y as 
Y = Ho + (Hy — Ho) Ww + vo + w(vi — vo), (21.60) 


where u, = E( yg) and vy = yy — Uy, g = 0, 1. However, unlike in Section 21.3, we do 
not assume that vo and vı are mean independent of w, given x. Instead, we assume the 
availability of instruments, which we collect in the vector z. (Here we separate the 
extra instruments from the covariates, so that x and z do not overlap. In many cases z 
is a scalar, but the analysis is no easier in that case.) 

If we assume that the stochastic parts of y; and yọ are the same, that is, vj = vo, 
then the interaction term disappears (and Tare = Tatt). Without the interaction term we 
can use standard IV methods under weak assumptions. 


ASSUMPTION ATEIV.1: (a) In equation (21.60), vı = vo; (b) L(vo |x, z) = L(vo |x); 
and (c) L(w|x,z) 4 L(w|x). 


All linear projections in this chapter contain unity, which we suppress for notational 
simplicity. 
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Under parts a and b of Assumption ATEIV.1, we can write 
y = ĉo + Tw + xBy + uo, (21.61) 


where T= Tate and uo = vo — L(vo |x, z). By definition, up has zero mean and is 
uncorrelated with (x,z), but w and uo are generally correlated, which makes OLS 
estimation of equation (21.61) inconsistent. The redundancy of z in the linear pro- 
jection L(vo | x,z) means that z is appropriately excluded from equation (21.61); this 
is the part of identification that we cannot test (except indirectly using the over- 
identification test from Chapter 6). Part c means that z has predictive power in 
the linear projection of treatment on (x,z); this is the standard rank condition for 
identification from Chapter 5, and we can test it using a first-stage regression and 
heteroskedasticity-robust tests of exclusion restrictions. Under Assumption ATEIV.1, 
t (and the other parameters in equation (21.61)), are identified, and they can be 
consistently estimated by 2SLS. Because the only endogenous explanatory variable in 
equation (21.61) is binary, equation (21.60) is called a dummy endogenous variable 
model (Heckman, 1978). As we discussed in Chapter 5, there are no special consid- 
erations in estimating equation (21.61) by 2SLS when the endogenous explanatory 
variable is binary. 

Assumption ATEIV.1b holds if the instruments z are independent of (yọ, x). For 
example, suppose z is a scalar determining eligibility in a job training program or 
some other social program. Actual participation, w, might be correlated with vo, 
which could contain unobserved ability. If eligibility is randomly assigned, it is often 
reasonable to assume that z is independent of (yọ,x). Eligibility would positively 
influence participation, and so Assumption ATEIV.1c should hold. 

Random assignment of eligibility is no guarantee that eligibility is a valid instru- 
ment for participation. The outcome of z could affect other behavior, which could 
feed back into up in equation (21.61). For example, consider Angrist’s (1990) draft 
lottery application, where draft lottery number is used as an instrument for enlisting. 
Lottery number clearly affected enlistment, so Assumption ATEIV.lIc is satisfied. 
Assumption ATEIV.1b is also satisfied if men did not change behavior in unobserved 
ways that affect wage, based on their lottery number. One concern is that men with 
low lottery numbers may get more education as a way of avoiding service through a 
deferment. Including years of education in x effectively solves this problem. But what 
if men with high draft lottery numbers received more job training because employers 
did not fear losing them? If a measure of job training status cannot be included in 
x, lottery number would generally be correlated with uo. See Angrist, Imbens, and 
Rubin (1996) and Heckman (1997) for additional discussion. 
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As the previous discussion implies, the redundancy condition in Assumption 
ATEIV.1b allows the instruments z to be correlated with elements of x. For example, 
in the population of high school graduates, if w is a college degree indicator and the 
instrument z is distance to the nearest college while attending high school, then z is 
allowed to be correlated with other controls in the wage equation, such as geographic 
indicators. 

Under vı = vo and the key assumptions on the instruments, 2SLS on equation 
(21.61) is consistent and asymptotically normal. But if we make stronger assump- 
tions, we can find a more efficient IV estimator. 


ASSUMPTION ATEIV.1’: (a) In equation (21.60), vı = vo; (b) E(vo | x, z) = L(vo | x); 
(c) P(w = 1|x,z) # P(w = 1|x) and P(w = 1|x,z) = G(x,z;y) is a known para- 
metric form (usually probit or logit); and (d) Var(vo | x,z) = ož. 

Part b assumes that E(vo | x) is linear in x, and so it is more restrictive than Assumption 
ATEIV.1b. It does not usually hold for discrete response variables y, although it may 
be a reasonable approximation in some cases. Under parts a and b, the error uo in 
equation (21.61) has a zero conditional mean: 


E(uo |x, z) = 0. (21.62) 


Part d implies that Var(uo | x,z) is constant. From the results on efficient choice of 
instruments in Section 14.4.3, the optimal IV for w is E(w|x,z) = G(x, z; y). There- 
fore, we can use a two-step IV method: 


Procedure 21.1 (Under Assumption ATEIV.1'): (a) Estimate the binary response 
model P(w = 1|x,z) = G(x,z;7) by maximum likelihood. Obtain the fitted proba- 
bilities, G;. The leading case occurs when P(w = 1|x,z) follows a probit model. 

(b) Estimate equation (21.61) by IV using instruments 1, G;, and x;. 


There are several nice features of this IV estimator. First, it can be shown that the 
conditions sufficient to ignore the estimation of y in the first stage hold; see Section 
6.1.2. Therefore, the usual 2SLS standard errors and test statistics are asymptotically 
valid. Second, under Assumption ATEIV.1’, the IV estimator from step b is asymp- 
totically efficient in the class of estimators where the IVs are functions of (x;,z;); see 
Problem 8.11. If Assumption ATEIV.1'd does not hold, all statistics should be made 
robust to heteroskedasticity, and we no longer have the efficient IV estimator. 
Procedure 21.1 has an important robustness property. Because we are using G; as an 
instrument for w;, the model for P(w = 1|x,z) does not have to be correctly specified. 
For example, if we specify a probit model for P(w = 1 |x, z), we do not need the probit 
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model to be correct. Generally, what we need is that the linear projection of w onto 
[x, G(x, z;y*)] actually depends on G(x, z; y*), where we use y* to denote the plim of 
the maximum likelihood estimator when the model is misspecified (see White, 1982a, 
and Section 13.11.1). These requirements are fairly weak when z is partially corre- 
lated with w. 

Technically, t and 2 are identified even if we do not have extra exogenous variables 
excluded from x. But we can rarely justify the estimator in this case. For concrete- 
ness, suppose that w given x follows a probit model (and we have no z, or z does not 
appear in P(w = 1|x,z)). Because G(x, y) = ®(7) + xy,) is a nonlinear function of x, 
it is not perfectly correlated with x, so it can be used as an IV for w. This situation is 
very similar to the one discussed in Section 19.6.1: while identification holds for all 
values of « and $ if y; # 0, we are achieving identification off of the nonlinearity of 
P(w = 1 |x). Further, ®(yọ + xy,) and x are typically highly correlated. As we dis- 
cussed in Section 5.2.6, severe multicollinearity among the IVs can result in very im- 
precise IV estimators. In fact, if P(w = 1|x) followed a linear probability model, t 
would not be identified. See Problem 21.5 for an illustration. 


Example 21.3 (Estimating the Effects of Education on Fertility): We use the data in 
FERTIL2.RAW to estimate the effect of attaining at least seven years of education 
on fertility. The data are for women of childbearing age in Botswana. Seven years of 
education is, by far, the modal amount of positive education. (About 21 percent of 
women report zero years of education. For the subsample with positive education, 
about 33 percent report seven years of education.) Let y = children, the number of 
living children, and let w = educ7 be a binary indicator for at least seven years of 
education. The elements of x are age, age”, evermarr (ever married), urban (lives in an 
urban area), electric (has electricity), and tv (has a television). 

The OLS estimate of t is —.394 (se = .050). We also use the variable /rsthalf, a 
binary variable equal to one if the woman was born in the first half of the year, as an 
IV for educ7. It is easily shown that educ7 and frsthalf are significantly negatively 
related. The usual IV estimate is much larger in magnitude than the OLS estimate, 
but only marginally significant: —1.131 (se = .619). The estimate from Procedure 
21.1 is even bigger in magnitude, and very significant: —1.975 (se = .332). The stan- 
dard error that is robust to arbitrary heteroskedasticity is even smaller. Therefore, 
using the probit fitted values as an IV, rather than the usual linear projection, pro- 
duces a more precise estimate (and one notably larger in magnitude). 

The IV estimate of education effect seems very large. One possible problem is that, 
because children is a nonnegative integer that piles up at zero, the assumptions 
underlying Procedure 21.1—namely, Assumptions ATEIV.1’a and ATEIV.1’b— 
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might not be met. We could instead apply methods for exponential response functions 
in Section 18.5. Both the Terza (1998) and Mullahy (1997) approaches can be derived 
using a counterfactual framework. 


In principle, it is important to recognize that Procedure 21.1 is not the same as 
using G as a regressor in place of w. That is, IV estimation of equation (21.61) is not 
the same as the OLS estimator from 


yı on 1,G6,,x; (21.63) 
Consistency of the OLS estimators from regression (21.63) relies on having the model 
for P(w = 1|x,z) correctly specified. If the first three parts of Assumption ATEIV. 1’ 
hold, then 


E(y|x,Z) = do + tG(x, z; y) + xB, 


and, from the results on generated regressors in Chapter 6, the estimators from re- 
gression (21.63) are generally consistent. But Procedure 21.1 is more robust because it 
does not require Assumption ATEIV.1’c for consistency. 

A different way to see the robustness of the IV approach compared with the 
regression approach is to think about the underlying first-stage regression in the 
population, which is the linear projection of w on [1, G(x,z;y*),x]; write this as 
No +, G(x, z; y*) + xq. The IV estimator is consistent for any values of the etas 
provided 7, # 0 because we simply need an exogenous variable that moves around 
w that is not perfectly correlated with x. By contrast, consistency of the two-step 
regression estimator for t requires y, = 1. If P(w = 1|x,z) = G(x, z; y*), then yọ = 0, 
ny, =1, and Į = 0, and so the linear projection of y on [1, G(x,z;y*),x] is ĝo + 
tG(x,z; y*) + xf; thus, regression (21.63) consistently estimates all parameters. But 
if G(x,z;y) is misspecified 4, can differ from unity (and 4, #0). Generally, the 
regression (21.63) consistently estimates 09 + ty, T41, and B+ ty as the intercept, 
coefficient on G;, and coefficients on x;, respectively. 

Another problem with regression (21.63) is that the usual OLS standard errors 
and test statistics are not valid, for two reasons. First, if Var(uo|x,z) is constant, 
Var(y|x,z) cannot be constant because Var(w|x,z) is not constant. By itself this 
is a minor nuisance because heteroskedasticity-robust standard errors and test sta- 
tistics are easy to obtain. (However, it does call into question the efficiency of the 
estimator from regression (21.63).) A more serious problem is that the asymptotic 
variance of the estimator from regression (21.63) depends on the asymptotic variance 
of 7 unless x = 0, and the heteroskedasticity-robust standard errors do not correct 
for this. 
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In summary, using fitted probabilities from a first-stage binary response model, 
such as probit or logit, as an instrument for w is a nice way to exploit the binary nature 
of the endogenous explanatory variable. In addition, the asymptotic inference is 
always standard. Using G; as an instrument does require the assumption that 
E(vo |x, z) depends only on x and is linear in x, which can be more restrictive than 
Assumption ATEIV.1’b. 

Allowing for the interaction w(v; — vo) in equation (21.60) is notably harder. In 
general, when vı Æ vo, the IV estimator (using z or G as IVs for w) does not con- 
sistently estimate Tate (Or Tan). Nevertheless, it is useful to find assumptions under 
which IV estimation does consistently estimate ATE. This problem has been studied 
by Angrist (1991), Heckman (1997), and Wooldridge (1997b, 2003b), and we syn- 
thesize results from these papers. 

Under the conditional mean redundancy assumptions 


E(vo |x, z) = E(vo | x) and E(v; |x, z) = E(vı | x), (21.64) 
we can always write equation (21.60) as 

Y = ho + tw + go(x) + wigi (x) — go(x)] + eo + w(e1 — eo), (21.65) 
where T = Tate and 


vo = go(x) + eo, E(eo |x,z) = 0, (21.66) 


v = gi(x) +e), E(e; |x,z) = 0. (21.67) 


Given functional form assumptions for go and g;—which would typically be linear 
in parameters—we can estimate equation (21.65) by IV, where the error term is 
eo + wle; — eo). For concreteness, suppose that 


Go(x) =o + XB, gilX) — go(x) = (x — wd, (21.68) 


where yw = E(x). If we plug these equations into equation (21.65), we need instru- 
ments for w and w(x—y) (note that x does not contain a constant here). If 
q = 4(x,z) is the instrument for w (such as the response probability in Procedure 
21.1), the natural instrument for w - x is q - x. (And, if q is the efficient IV for w, q- x 
is the efficient instrument for w - x.) When will applying IV to 


y =y +w + Xpo + w(x — w)d + eo + wer — eo) (21.69) 
be consistent? If the last term disappears, and, in particular, if 


€i = 20, (21.70) 
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then the error eọ has zero mean given (x,z); this result means that IV estimation of 
equation (21.69) produces consistent, asymptotically normal estimators. 


ASSUMPTION ATEIV.2: With y expressed as in equation (21.60), conditions (21.64), 
(21.65), and (21.70) hold. In addition, Assumption ATEIV.1’c holds. 


We have the following extension of Procedure 21.1: 


Procedure 21.2 (Under Assumption ATEIV.2): (a) Same as Procedure 21.1. 
(b) Estimate the equation 


Yi = Y + Twi + XiBo + [wilx; — X)]6 + error; (21.71) 
by IV, using instruments 1, G;, x;, and G;(x; — X). 


If we add Assumption ATEIV.1'd, Procedure 21.2 produces the efficient IV estimator 
(when we ignore estimation of E(x)). As with Procedure 21.1, we do not actually 
need the binary response model to be correctly specified for identification. As an 
alternative, we can use z; and interactions between z; and x; as instruments, which 
generally results in testable overidentifying restrictions. 

Technically, the fact that x is an estimator of E(x) should be accounted for in 
computing the standard errors of the IV estimators. But, as shown in Problem 6.10, 
the adjustments for estimating E(x) often will have a trivial effect on the standard 
errors; in practice, we can just use the usual or heteroskedasticity-robust standard 
errors. Alternatively, we can apply the bootstrap. 


Example 21.4 (An IV Approach to Evaluating Job Training): To evaluate the effects 
of a job training program on subsequent wages, suppose that x includes education, 
experience, and the square of experience. If z indicates eligibility in the program, we 
would estimate the equation 


log(wage) = uo + t jobtrain + Bo, educ + Borexper + Bo3exper? 


+ 0, jobtrain - (educ — educ) + 62 jobtrain - (exper — exper) 


+ 63 jobtrain - (exper? — exper?) + error 


by IV, using instruments 1, z, educ, exper, exper?, and interactions of z with all 


demeaned covariates. Notice that for the last interaction, we subtract off the average 
of exper”. Alternatively, we could use in place of z the fitted values from a probit of 
jobtrain on (x, z). 


Procedure 21.2 is easy to carry out, but its consistency generally hinges on condi- 
tion (21.70), not to mention the functional form assumptions in equation (21.68). We 
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can relax condition (21.70) to 
E[w(e1 — eo) | x, z] = E[w(e1 — e0)] (21.72) 


We do not need w(e; — eo) to have zero mean, as a nonzero mean only affects the 
intercept. It is important to see that correlation between w and (e; — eo) does not in- 
validate the IV estimator of t from Procedure 21.2. However, we must assume that 
the covariance conditional on (x, z) is constant. Even if this assumption is not exactly 
true, it might be approximately true. 

It is easy to see why, along with conditions (21.64) and (21.68), condition (21.72) 
implies consistency of the IV estimator. We can write equation (21.69) as 


y = Č + tw + Xpo + w(x — y)ô + eo +F, (21.73) 


where r= w(e; — eo) — E[w(e1 — e0)] and € = y + E[w(eı — eo)]. Under condition 
(21.72), E(r|x,z) = 0, and so the composite error eọ +r has zero mean conditional 
on (x,z). Therefore, any function of (x,z) can be used as instruments in equation 
(21.73). Under the following modification of Assumption ATEIV.2, Procedure 21.2 is 
still consistent: 


ASSUMPTION ATEIV.2’: With y expressed as in equation (21.60), conditions (21.64), 
(21.68), and (21.72) hold. In addition, Assumption ATEIV.1'c holds. 


Even if Assumption ATEIV.1’d holds in addition to Assumption ATEIV.1’c, the IV 
estimator is generally not efficient because Var(r|x,z) would typically be hetero- 
skedastic. 

Angrist (1991) provided primitive conditions for assumption (21.72) in the case 
where z is independent of (,v1,x). Then, the covariates can be dropped entirely 
from the analysis (leading to IV estimation of the simple regression equation y = 
č + tw + error). We can extend those conditions here to allow z and x to be corre- 
lated. Assume that 


E(w|x,z,e1 — eo) = A(x,z) + k(e1 — eo) (21.74) 
for some functions h(-) and k(-) and that 
eı — eo is independent of (x, z). (21.75) 
Under these two assumptions, 
E[w(e1 — eo) | x, z| = A(x, z)E(e1 — eo | x, zZ) + E[(e; — eo)k(e1 — eo) | x, z] 

= h(x,z) -0+ Ef(e: — e0)k(e1 — e0)| 

= El(ei — e0)k(e1 — e0)], (21.76) 
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which is just an unconditional moment in the distribution of eı — eọ. We have used 
the fact that E(e; — eo | x, z) = 0 and that any function of e; — eo is independent of 
(x,z) under assumption (21.75). If we assume that k(-) is the identity function (as in 
Wooldridge, 1997b), then equation (21.76) is Var(e; — eo). 

Assumption (21.75) is reasonable for continuously distributed responses, but it 
would not generally be reasonable when y is a discrete response or corner solution 
outcome. Further, even if assumption (21.75) holds, assumption (21.74) is violated 
when w given x, z, and (e; — eo) follows a standard binary response model. For ex- 
ample, a probit model would have 


P(w = 1|x,z,e1 — eo) = ®lm + xm) + zm + p(ei — eo)], (21.77) 


which is not separable in (x, z) and (e; — eo). Nevertheless, assumption (21.74) might 
be a reasonable approximation in some cases. Without covariates, Angrist (1991) 
presents simulation evidence that suggests the simple IV estimator does quite well for 
estimating the ATE even when assumption (21.74) is violated. 


21.4.2 Correction and Control Function Approaches 


Rather than assuming that the presence of w(e; — eo) in the error term does not 
cause inconsistency for IV estimators, we can use functional form and distribu- 
tional assumptions to directly account for this term. The first approach we consider, 
proposed by Wooldridge (2008), involves adding a correction function, which is a 
function of the exogenous variables (x,z), to equation (21.73), and then applying 
instrumental variables to account for the endogeneity of w and w(x — y). To derive 
the correction function, we add to assumptions (21.75) and (21.77) a normality 
assumption. Let c = e; — eo and assume 


c ~ Normal(0, a”) (21.78) 


Under assumptions (21.75), (21.77), and (21.78) we can derive an estimating equation 
to show that Tare is usually identified. 

To derive an estimating equation, note that conditions (21.75), (21.77), and (21.78) 
imply that 


P(w = 1|x,z) = (0o + xO; + 262), (21.79) 


where each theta is the corresponding pi multiplied by [1 + p2w?]~'””. If we let a de- 


note the latent error underlying equation (21.79) (with a standard normal distribu- 
tion), then conditions (21.75), (21.77), and (21.78) imply that (a,c) has a zero-mean 
bivariate normal distribution that is independent of (x,z). Therefore, E(c|a,x,z) = 
E(c|a) = ča for some parameter č, and 
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E(we | x,z) = E[wE(c| a,x, z) |x, z] = €E(wa|x,z). 


Using the fact that a ~ Normal(0, 1) and is independent of (x,z), we have 
E(wa|x,z) = | f 1[09 + xO; + 202 + a > Olad(a) da 


= $(—{9 + x0; + 202}) = $(09 + xO; + 20>), (21.80) 
where ¢(-) is the standard normal density. Therefore, we can now write 
y=y+twt xP + w(x — w)d + €b(00 + XO; + 202) + e0 +r, (21.81) 


where r = wc — E(wc|x,z). The composite error in equation (21.81) has zero mean 
conditional on (x,z), and so we can estimate the parameters using IV methods. One 
catch is the nonlinear function ¢(0) + x0; + 202). We could use nonlinear two-stage 
least squares, as described in Chapter 14. But a two-step approach is easier. First, we 
gather together the assumptions: 


ASSUMPTION ATEIV.3: With y written as in equation (21.70), maintain assumptions 
(21.64), (21.68), (21.75), (21.77), (with m 4 0), and (21.78). 


Procedure 21.3 (Under Assumption ATEIV.3): (a) Estimate 09, 01, and 02 from 
a probit of w on (1,x,z). Form the predicted probabilities, Ô; along with Ê; = 
olo + x; + 2;02), i= LD ctdylVs 

(b) Estimate the equation 


y= y + tw; + XiBo + wi(x; — X) + EG; + error; (21.82) 
by IV, using instruments [1, Ô;,x;, Ô;(x; -3), gi]. 


Wooldridge (2008) calls the extra term d, = (0) +x; + 2/02) a “correction 
function” to distinguish it from the more common “control function,” which we turn 
to shortly. Unlike with the control function approach, adding ¢(6) + x;0) + 202) 
does not render w or w(x — y) exogenous in equation (21.81). Rather, it ensures that 
the composite error, eo + r, is mean independent of (x,z) when w(e; — eo) is present 
and not assumed independent of (x,z). Conveniently, IV estimation of equation 
(21.82) leads to a simple test of Ho: č = 0. Under the null, the coefficient on the 
generated regressor, d; is zero. Further, that the IVs are estimated in a first stage does 
not affect the /N-asymptotic distribution of the IV estimators; see Section 6.1.3. 
Therefore, if we ignore the estimation error in X, we can use a standard hetero- 
skedasticity-robust ¢ statistic on d, to test whether the correction function is needed. 
(Note that r = we — E(wc|x,z) can be homoskedastic under the null but there is 
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never any harm in making the test robust to heteroskedasticity.) Notice that the only 
place we used normality of (c,a) is in deriving the correction function. Because € = 0 
under the null, this normality assumption is not needed for the test. (Remember, our 
use of probit-fitted values as instruments for w does not hinge on the probit model for 
w being true; it motivates the choice of IVs, but the probit model can be arbitrarily 
misspecified.) 

If we allow the possibility that č # 0, all standard errors need to be adjusted for the 
two-step estimation. One can use the delta method or express the two-step estimation 
as a generalized method of moments problem as in Chapter 14. (Either approach can 
ignore or account for sampling error in x.) Because the two-step estimation is com- 
putationally simple, the bootstrap is attractive as a way to account for all sampling 
error in T. 

Even if é #0, adding the correction function ¢, to the equation need not have 
much effect on the estimate of t. To see why, consider the version of equation (21.81) 
when we have no covariates x and with a scalar IV, z: 


y=ytw+ Eo(Oo+0z)+u,  E(ulz)=0 (21.83) 


This equation holds, for example, if the instrument z is independent of (vo, vı). The 
simple IV estimator of t is obtained by omitting 4(0 + 62). If we use z as an IV for 
w, the simple IV estimator is consistent provided z and ¢(@) + 0:2) are uncorrelated. 
(Remember, having an omitted variable that is uncorrelated with the IV does not 
cause inconsistency of the IV estimator.) Even though ¢(09 + 0ız) is a function of 
z, these two variables might have small correlation because z is monotonic while 
(0o + 01z) is symmetric about —(0ọ/01). This discussion shows that condition 
(21.72) is not necessary for IV to consistently estimate the ATE: It could be that while 
E|w(e1 — eo) | x, z] is not constant, it is roughly uncorrelated with x (or the functions 
of x) that appear in equation (21.73), as well as with the functions of z used as 
instruments. 

Equation (21.83) illustrates another important point: If € #0 and the single 
instrument z is binary, t is not identified. Lack of identification occurs because 
@(09 + 012z) takes on only two values, which means it is perfectly linearly related to z. 
So long as z takes on more than two values, t is generally identified, although the 
identification is due to the fact that ¢(-) is a different nonlinear function than ®(.). 
With x in the model d; and ÔÊ, might be collinear, resulting in imprecise IV estimates. 

Because r in equation (21.81) is heteroskedastic, the instruments below equation 
(21.82) are not optimal, and so we might simply use z; along with interactions of z; 
with (x; — x) and ¢; as IVs. If z; has dimension greater than one, then we can test the 
overidentifying restrictions as a partial test of instrument selection and the normality 


948 Chapter 21 


assumptions. Of course, we could use the results of Chapter 14 to characterize and 
estimate the optimal instruments, but this approach is fairly involved [see, for exam- 
ple, Newey and McFadden (1994)]. 

We can use a similar set of assumptions to derive a control function (CF) approach 
for estimating T = Tate. Recall that the CF approach involves finding E(y|w,x,z) 
and then using regression methods. (Or, a maximum likelihood approach is often 
available under a stronger set of assumptions.) Typically, the CF approach is derived 
in the context of the endogenous switching regression model, but it is easily seen that 
the treatment effect model with heterogeneous treatment, where the treatment is 
correlated with unobservables even after conditioning on observables, fits that bill 
when the treatment is defined to be the binary switching variable. In particular, 
equation (21.69) results by writing y =(1—w)yo + wyı and imposing the linear, 
additive structures on yo and y1: 


y = (1 — w) (xo + xpo + eo) + w(&1 + xP, + e1) 
= a9 + Xfo + (%1 — xo)w + wx(B, — Bo) + eo + wei — eo) 
=y4+xfP)+tmw4 w(x — w)d + eo + we, 


where T = Tae = E( yı — yo) and y = E(x). This particular way of expressing y in 
terms of the treatment, covariates, and unobservables emphasizes that we are pri- 
marily interested in z, although ty,.(x) = t + (x — w)6 is of interest for studying how 
the average treatment effect changes as a function of observables. 

We can derive a control function method under the following assumption. 


ASSUMPTION ATEIV.4: With y written as in equation (21.70), maintain assumptions 
(21.64) and (21.68). Furthermore, the treatment can be written as w = 1[0) + x01 + 
202 + a = 0], where (a, e9, e1) is independent of (x,z) with a trivariate normal distri- 
bution; in particular, a ~ Normal(0, 1). 


Under Assumption ATEIV.4, we can use calculations very similar to those used in 
Section 19.6.1 to obtain E(y|w,x,z). In particular, 


E(y |w,x,z) = y + aw + xB) + w(x — w)d + p,wld(qd)/®(qA)| 
+ po(1 — w){¢(q9)/[1 — B(q9)]} (21.84) 


where q0 = Oo + xO; + z02 and p; and p, are additional parameters. Heckman (1978) 
used this expectation to obtain two-step estimators of the switching regression model. 
(See Vella and Verbeek (1999) for a recent discussion of the switching regression 
model in the context of treatment effects.) Not surprisingly, equation (21.84) suggests 
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a simple two-step procedure, where the first step is identical to that in Procedure 
21.3: 


Procedure 21.4 (Under Assumption ATEIV.4): (a) Estimate 6, 01, and 0) from a 
probit of w on (1,x,z). Form the predicted probabilities, ®;, along with ¢, = (Ôo + 
x0 + 2;02), oe eee he 

(b) Run the OLS regression 


y; on 1, w; X;, wi(X; — X), wi(G,/®;), (1 — w;)[4;/(1 — ®))] (21.85) 


using all of the observations. The coefficient on w; is a consistent estimator of «, the 
ATE. 


When we restrict attention to the w; = 1 subsample, thereby dropping w; and 
w;(x; — X), we obtain the sample selection correction from Section 19.6.1. (The 
treatment w; becomes the sample selection indicator.) But the goal of sample selec- 
tion corrections is very different from estimating an average treatment effect. For the 
sample selection problem, the goal is to estimate fọ, which indexes E(y |x) in the 
population. By contrast, in estimating an ATE we are interested in the causal effect 
that w has on y. 

It makes sense to check for joint significance of the last two regressors in regression 
(21.85) as a test of endogeneity of w. Because the coefficients pı and p, are zero under 
Ho, we can use the results from Chapter 6 to justify the usual Wald test (perhaps 
made robust to heteroskedasticity). If these terms are jointly insignificant at a suffi- 
ciently high level, we can justify the usual OLS regression without unobserved heter- 
ogeneity. If we reject Ho, we must deal with the generated regressors problem in 
obtaining a valid standard error for ĝ. 

Technically, Procedure 21.3 is more robust than Procedure 21.4 because the former 
does not require a trivariate normality assumption. Linear conditional expectations, 
along with the assumption that w given (x,z) follows a probit, suffice. In addition, 
Procedure 21.3 allows us to separate the issues of endogeneity of w and nonconstant 
treatment effect. 

Practically, the extra assumption in Procedure 21.4 is that eg is independent of 
(x,z) with a normal distribution. We may be willing to make this assumption, espe- 
cially if the estimates from Procedure 21.3 are too imprecise to be useful. The efficiency 
issue is a difficult one because of the two-step estimation involved, but, intuitively, 
Procedure 21.4 is likely to be more efficient because it is based on E(y|w,x,z). Pro- 
cedure 21.3 involves replacing the unobserved composite error with its expectation 
conditional only on (x,z). In at least one case, Procedure 21.4 gives results when 
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Procedure 21.3 cannot: when x is not in the equation and there is a single binary 
instrument. 

So far we have focused on estimating Tare. Under a variant of Assumption 
ATEIV.2, we can consistently estimate Tan by IV. As before, we express y as in 
equation (21.60). First, we show how to consistently estimate Ta;(x), which can be 
written as 


Tatt(X) = E( yı — yo |x, w = 1) = (4 — 4) + E(v1 — vo |x, w = 1). 
The following assumption identifies t,;(x): 


ASSUMPTION ATTIV.1: (a) With y expressed as in equation (21.60), the first part of 
assumption (21.64) holds, that is, E(vo | x,z) = E(vo |x); (b) E(vı — vo | x,z,w = 1) = 
E(vı — vo | x, w = 1); and (c) Assumption ATEIV.1’c holds. 


We discussed part a of this assumption earlier, as it also appears in Assumption 
ATEIV.2’. It can be violated if agents change their behavior based on z. Part b 
deserves some discussion. Recall that vı — vo is the person-specific gain from partici- 
pation or treatment. Assumption ATTIV.1 requires that for those in the treatment 
group, the gain is not predictable given z, once x is controlled for. Heckman (1997) 
discusses Angrist’s (1990) draft lottery example, where z (a scalar) is draft lottery 
number. Men who had a large z were virtually certain to escape the draft. But some 
men with large draft numbers chose to serve anyway. Even with good controls in x, it 
seems plausible that, for those who chose to serve, a higher z is associated with a 
higher gain to military service. In other words, for those who chose to serve, vı — vo 
and z are positively correlated, even after controlling for x. This argument directly 
applies to estimation of Tan; the effect on estimation of Tare is less clear. 

Assumption ATTIV.1b is plausible when z is a binary indicator for eligibility in a 
program, which is randomly determined and does not induce changes in behavior 
other than whether or not to participate. 

To see how Assumption ATTIV. 1 identifies tan(x), rewrite equation (21.60) as 


Y = lo + go(X) + w| (41 — Mo) + E(v1 — vo | x, w = 1)] 
+ w[|(vı — vo) — E(vı — vo | x, w = 1)] + eo 
= ko + go(x) +w: Tan(x) +a+ eo, (21.86) 


where a = w[(vı — vo) — E(vı — vo | x, w = 1)] and eo is defined in equation (21.66). 
Under Assumption ATTIV.1a, E(eọ|x,z) = 0. The hard part is dealing with the 
term a. When w = 0, a = 0. Therefore, to show that E(a |x, z) = 0, it suffices to show 
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that E(a|x,z,w=1)=0. (Remember, E(a|x,z) = P(w =0)- E(a|x,z,w = 0) + 
P(w = 1) - E(a|x,z, w = 1).) But this result follows under Assumption ATTIV. 1b: 


E(a|x,z,w = 1) = E(u, — vo | x,z,w = 1) — E(u) — vo |x, w = 1) = 0. 


Now, letting r = a+ eo and assuming that go(x) = +h(x)B and ty(x) = «+ 
f(x)d for some row vector of functions h(x) and f(x), we can write 


y = Yo + ho(x) Bo + aw + [w- f(x)]d + r, E(r|x,z) = 0. 


All the parameters of this equation can be consistently estimated by IV, using any 
functions of (x, z) as IVs. (These would include include 1, ho(x), G(x, z; 7) —the fitted 
treatment probabilities—and G(x,z; 7) -f(x).) The average treatment effect on the 


treated for any x is estimated as «+f(x)d. Averaging over the observations with 
w; = | gives a consistent estimator of Tarr. 


21.4.3 Estimating the Local Average Treatment Effect by IV 


We now discuss estimation of an evaluation parameter introduced by Imbens and 
Angrist (1994), the local average treatment effect (LATE), in the simplest possible 
setting. This requires a slightly more complicated notation. (More general cases re- 
quire even more complicated notation, as in AIR.) As before, we let w be the 
observed treatment indicator (taking on zero or one), and let the counterfactual out- 
comes be yı with treatment and yọ without treatment. The observed outcome y can 
be written as in equation (21.3). 

To define Tare, we need to have an instrumental variable, z. In the simplest case z is 
a binary variable, and we focus attention on that case here. For each unit 7 in a ran- 
dom draw from the population, z; is zero or one. Associated with the two possible 
outcomes on z are counterfactual treatments, wo and w,;. These are the treatment 
statuses we would observe if z = 0 and z = 1, respectively. For each unit, we observe 
only one of these. For example, z can denote whether a person is eligible for a par- 
ticular program, while w denotes actual participation in the program. 

Write the observed treatment status as 


w = (1 —z)wo + zw) = wo + 2(W1 — wo). (21.87) 
When we plug this equation into y = yo + w(y; — yo) we get 

Y = yo + wol yı — yo) + z(wi — Wo)(¥1 — V0). 

A key assumption is 


z is independent of (yọ, y1, Wo, w1). (21.88) 
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Under assumption (21.88), all expectations involving functions of (yọ, yı, Wo, w1), 
conditional on z, do not depend on z. Therefore, 


E(y|z = 1) = E(y) + Elwo(y1 — yo)] + Elm — wo)(%1 — yo)l, 


E(y|z = 0) = E(yq) + Elwo(y1 — yo)]. 
Subtracting the second equation from the first gives 


E(y|z = 1) — E(y|z = 0) = El(n — o)( — yo) (21.89) 


which can be written (see equation (2.49)) as 


1-E(y; — yo| wi — wo = 1)P(wı — wo = 1) 
+ (—1)E(y; — yo |wi — wo = —1)P(wı — wo = —1) 
+0- E(yı — Yo | wi — Wo = 0)P(w: — wọ = 0) 


= E(y, — yo | wi — wo = 1)P(wı — wo = 1) 


E(y; — yo | w1 — Wo = —1)P(w;ı — wo = —!1) 


To get further, we introduce another important assumption, called monotonicity by 
Imbens and Angrist: 


wi > wọ. (21.90) 


In other words, we are ruling out wı = 0 and wo = 1. This assumption has a simple 
interpretation when z is a dummy variable representing eligibility for treatment: 
anyone in the population who would be in the treatment group in the absence of eli- 
gibility would be in the treatment group if eligible for treatment group. Units of the 
population who do not satisfy monotonicity are called defiers. In many applications, 
this assumption seems very reasonable. For example, if z denotes randomly assigned 
eligibility in a job training program, assumption (21.90) simply requires that people 
who would participate without being eligible would also participate if eligible. 

Under assumption (21.90), P(wı — wo = —1) = 0, so assumptions (21.88) and 
(21.90) imply 


E(y|z = 1) — E(y|z = 0) = E(y, — yo |wı — wo = 1)P(w;ı — wo = 1). (21.91) 
In this setup, Imbens and Angrist (1994) define Tae to be 
Tlate = E( yı — yo| wi — wo = 1). (21.92) 
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Because w; — wo = | is equivalent to w; = 1, wo = 0, Tate has the following inter- 
pretation: it is the average treatment effect for those who would be induced to par- 
ticipate by changing z from zero to one. There are two things about Tate that make it 
different from the other treatment parameters. First, it depends on the instrument, z. 
If we use a different instrument, then Tate generally changes. The parameters Tare and 
Tan are defined without reference to an IV, but only with reference to a population. 
Second, because we cannot observe both w; and wo, we cannot identify the sub- 
population with w; — wo = 1. By contrast, Tate averages over the entire population, 
while Tan is the average for those who are actually treated. 


Example 21.5 (LATE for Attending a Catholic High School): Suppose that y is a 
standardized test score, w is an indicator for attending a Catholic high school, and z 
is an indicator for whether the student is Catholic. Then, generally, Tate is the mean 
effect on test scores for those individuals who choose a Catholic high school because 
they are Catholic. Evans and Schwab (1995) use a high school graduation indicator 
for y, and they estimate a probit model with an endogenous binary explanatory 
variable, as described in Section 15.7.3. Under the probit assumptions, it is pos- 
sible to estimate Tate, whereas the simple IV estimator identifies Tate under weaker 
assumptions. 


Because E(y|z = 1) and E(y |z = 0) are easily estimated using a random sample, 
Tlate is identified if P(w; — wo = 1) is estimable and nonzero. Importantly, from the 
monotonicity assumption, wı — wọ is a binary variable because P(w; — wo = —1) = 0. 
Therefore, 


P(w; — Wo = 1) = E(w; — wo) = E(w) — E(wo) = E(w | z = 1) — E(w] z = 0) 
=P(w=1|z=1)-P(w=1|z=0), 


where the second-to-last equality follows from equations (21.87) and (21.88). Each 
conditional probability can be consistently estimated given a random sample on 
(w,z). Therefore, the final assumption is 


P(w=1|z=1) #P(w=1|z=0). (21.93) 
To summarize, under assumptions (21.88), (21.90), and (21.93), 
tue = [E(y|2 = 1) — E(y|2 = 0))/[P(w = 1 


Therefore, a consistent estimator is tigre = (Yı — ¥o)/(W1 — Wo), where p, is the sam- 
ple average of y; over that part of the sample where z; = 1 and jp is the sample 
average over z; = 0, and similarly for w} and Wọ (which are sample proportions). 


z=1)—-P(w=1|z=0)}. (21.94) 
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From Problem 5.13b, we know that tj. is identical to the IV estimator of t in the 
simple equation y = ĝo + tw + error, where z is the IV for w. 

Our conclusion is that, in the simple case of a binary instrument for the binary 
treatment, the usual IV estimator consistently estimates tj under weak assumptions. 
See Angrist, Imbens, and Rubin (1996) and the discussants’ comments for much 
more, and Imbens and Wooldridge (2009) for a survey of extensions to the basic 
framework. 


21.5 Regression Discontinuity Designs 


We now turn to estimating treatment effects in a special situation where certain 
institutional or logical structures act to determine, or at least affect, treatment 
assignment. Regression discontinuity (RD) designs have a long history. Important 
work in bringing the methods into the mainstream in economics includes van der 
Klaauw (2002) and Hahn, Todd, and van der Klaauw (2001) (or HTV (2001)). 
Imbens and Lemieux (2008) provide a nice overview and survey of the subject, and 
this section draws on this work, as well as Imbens and Wooldridge (2009). 

Generally, RD designs exploit discontinuities in policy assignment. For example, 
there might be an age threshold at which one becomes eligible for pension plan 
vesting, or an income threshold at which one becomes eligible for financial aid. To 
exploit discontinuities that are often determined by rather ad hoc institutional struc- 
tures, one assumes that units just on different sides of the discontinuity are essentially 
the same in unobservables that affect the relevant outcome. The treatment statuses 
of the two groups differ, say, because of the institutional setup, in which case differ- 
ences in outcomes can be attributed to the different treatment statuses. 

We consider the sharp design RD—where assignment follows a deterministic 
rule—and the fuzzy design, where the probability of being treated is discontinuous at 
a known point. 


21.5.1 The Sharp Regression Discontinuity Design 


As before, let yo and yı denote the counterfactual outcomes without and with treat- 
ment. For a random draw i, these are denoted yj, yj. For now, assume there is a 
single covariate, x;, determining treatment (sometimes called the forcing variable). In 
the sharp regression discontinuity (SRD) design case, treatment is determined as 


wi = 1[x; = q. (21.95) 


Along with the forcing variable x; and treatment status w;, we observe yj; = 
(1 — wi) yo + WiXa. 
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Define, as before, the counterfactual conditional means as u(x) = E(yg |x), 
g = 0,1. A critical assumption in RD designs is that these underlying mean functions 
are continuous. (Technically, they are continuous at x = c, but it is hard to imagine 
how we could ensure that they are without assuming continuity over the range of x.) 
Now, because w is a deterministic function of x, ignorability necessarily holds. Stated 
in terms of conditional means, 


E( yg 


which is exactly Assumption ATE.1’. So, if ignorability holds, why can we not just 
apply previous methods, such as propensity score weighting? The key is that with a 
sharp RD design, the overlap assumption fails absolutely: by construction, p(x) = 0 
for all x < c and p(x) = 1 for x > c. Clearly we cannot use a method such as pro- 
pensity score weighting. 

Technically, we can use regression adjustment, but we would have to rely on 
extreme forms of extrapolation in using parametric models. For example, if we esti- 
mate Tate using general regression adjustment of the form (21.30)—where mj (x, 61) = 
E(y|x,w = 1) and mo(x, ôo) = E(y|x, w = 0)—then, say, for control observations 
we must compute m (xj, 61) for x; < c, even though no data points with x; < c were 
used in obtaining ôi. If we use local smoothing in a nonparametric setting, we could 
not convincingly estimate m(x) for x < c or mo(x) for x > c. 

Rather than trying to estimate Tate, which relies on strong functional form 
assumptions unless we just assume a constant treatment effect, the RD literature 
often settles for estimating the average treatment effect at the discontinuity point, 
defined as 


x,w) = E(yg|x), g= 0,1, 


x = 0) = (e) — p(o): (21.96) 


This is the average treatment effect for those at the margin of receiving the treatment. 
By focusing on te we are generally sacrificing external validity of the estimated 
treatment effect because in different settings the cutoff c may not be particularly 
relevant. 

A leading reason for focusing on te is that it is generally identified without any 
assumptions other than that the w,(-) functions are continuous at x = c. To see why, 
write 


T= E(yı — yo 


y=(1 = w)yo + wyi = Ix < yo + Ix > dyi, 
and so 


m(x) = E(y 


x) = 1[x < c]uto(x) + [x > cl, (x). (21.97) 
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Now, using continuity of uọ(-) and y,(-) at c, 


m (c)= lim m(x) = u (c) 


m*(c)= lim m(x) = (c). 


It follows that 
Te = m+ (c) — m (c). (21.98) 


Because we can generally estimate E(y | x, w = 0) for all x < cand E(y |x, w = 1) for 
all x > c, we can estimate the limits of these functions as x approaches c (from the 
appropriate direction). As a technical point, we must estimate the two regression 
functions at the boundary value, c, but several strategies have proved useful. 

One such strategy is local linear regression. Define jo. = (Cc), Hie = 44 (c) and 
write 


Yo = Moe + Bo(x — c) + uo (21.99) 
Yı = Hie + Pi(x — c) + u (21.100) 
so that 

Y = Hoc + Tew + Bo(x — c) + Ow: (x — c) +r, (21.101) 


where r = uo + w(u; — uo). The estimate of te is just the jump in the linear function at 
x = c. We could use the entire data set to run the regression 


yi on 1,w;(xi— c),wi: (xi— c) (21.102) 


and obtain ĉ. as the coefficient on w;, but this approach would be global estimation to 
estimate a localized average treatment effect, te. To make this a “local” procedure, 
choose a “small” value A > 0 and only use the data satisfying c — h < x; < c + h. Of 
course, this is equivalent to estimating two separate regressions: y; on 1, (x; — c) for 
c-h<x;< cand y; on 1, (x;—c) for c < x; < c+ h, and then t = ĝe — fi, is the 
difference in the intercepts. For extra flexibility, we can use a quadratic or cubic in 
(x; — c); if a single regression is used, the polynomials should also be interacted with 
Wj. 

If h is viewed as a fixed value chosen ahead of time by the researcher, inference 
on tf, is standard: just use a heteroskedasticity-robust ¢ standard error. Imbens and 
Lemieux (2008) show that if h shrinks to zero quickly enough, the usual inference is 
still valid. While one can experiment with different choices of h—trading off more 
bias when A is large versus a smaller variance—one can use a data-based method. 
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Imbens and Kalyanaraman (2009) explicitly look at choosing h to minimize 


E{[Ao(c) — mole)? + lå) — a (0)]"}, (21.103) 


a total mean squared error for the two regression functions at the jump point. The 
optimal bandwidth choice depends on second derivatives of the regression functions 
at x = c, the density of x; at x = c, the conditional variances, and the kernel used in 
local linear regression. See Imbens and Kalraynaram (2008) for details. 

Adding regressors is no problem: if the regressors are r;, just run the regression y; 
on 1, wi, (x; — c), wi: (x; — c), r;, again only using data c—h<xj<c+h. As dis- 
cussed in Imbens and Lemieux (2008), using extra regressors is likely to have more of 
an impact when / is large, and it might help reduce the bias arising from the deteri- 
oration of the linear approximation. Another reason for adding r; is that doing so can 
reduce the error variance, possibly improving the precision of ĉe. 

For response variables with limited range, we can use local versions of other 
estimation methods. For example, suppose the y, are count variables. Then we 
might use the observations with c—h < x; < c to estimate a Poisson regression 
E(y|x, w = 0) = exp(% + fox) and use c < x; < c + h to estimate a Poisson regres- 
sion E(y|x,w = 1) = exp(a + 6,x). Of course, we could include more flexible func- 
tions of x, too. If the exponential regression functions are correctly specified for x 
“near” c, we can estimate T, as 


ĉe = exp(& + ic) — exp(d + foc). (21.104) 
21.5.2 The Fuzzy Regression Discontinuity Design 


In the fuzzy regression discontinuity (FRD) design, the probability of treatment 
changes discontinuously at x = c. Define the propensity score as a function of the 
scalar x as 


P(w = 1|x) = F(x). (21.105) 


As in the SRD case, we still assume that the counterfactual conditional mean func- 
tions 4ọ(-) and y,(-) are continuous at c. The key assumption for the FRD design is 
that F(-) is discontinuous at c, so that there is a discrete jump in the probability of 
treatment at the cutoff. 

There are various ways to identify te. An assumption that leads to a fairly 
straightforward analysis is that the individual-specific gain, yı — yo, is independent of 
w, conditional on x. Therefore, treatment w is allowed to be correlated with yo (after 
conditioning on x) but not with the unobserved gain from treatment. Compared with 
identifying other average treatment effects, estimating t,,, requires (some version of) 
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ignorability of w with respect to yo while Tate requires that w is ignorable with respect 
o (yo, yı) (which clearly implies the other two assumptions). 

To see how te is identified, again write y = yo + w(y1 — yo) and use conditional 
independence between w and (y1 — yo): 


E(y|x) = E(yo |x) + E(w|x)E(yı — yo 
= m(x) + E(w | x) - t(x). 


As in the SRD case, take limits from the right and left and use continuity of 4ọ(-) and 
t(-) atc: 


m* (c) = Hoc) + F* (¢)te 
m (c) = mlc) +F (¢)te 


x) 


It follows that, if F+(c) # F~(c), then 
o [mt (e) — m (¢)] 
T= Te) r] (21.106) 


Because the mean and propensity score are generally identified in a neighborhood of 
C, Te is generally identified. 

Given consistent estimators of the four quantities in equation (21.106), we have 
, _ a (e) -^r (c)| 


t= BR) Flo (21.107) 


Imbens and Lemieux (2008) suggest estimating m+(c), m~ (c), F*(c), and F= (c) 
all by local linear regression. Namely, mt (c) = ĉie, m~(c) = ĉoe, F*(c) = Oj, and 
F “(c) = Boe are the intercepts from four local linear regressions. For example, 1, is 
from y; on 1, (x; —c),¢ < xi < c€ + h and 6). is from w; on 1, (x;— c), e <x; <c+A. 

As a computational device, and also for simple inference, HTV (2001) show 
that equation (21.107) is numerically identical to the local IV estimator of te in the 


equation 
Vi = He + Tew; + Bo(xi — c) + d1[x; > c|- (x; — c) + e;, (21.108) 


where z; = 1[x; > c] is the IV for w;. One uses the data such that c—h< xi < c +h. 
If h is fixed or is decreasing at a rate described in Imbens and Lemieux (2008), one 
can use the usual heteroskedasticity-robust IV standard error for ĉe. 

Rather than using a local linear model for the probability of treatment, we could, 
say, use a local logit model—for example, 
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P(w = 1 


x)= Ala + hE- 0), x<e 
P(w = 1|x)=AltatVilx—0)), xe 


and then use 
P+ (o) —F (6) = Afar) — Alfio), 
where (7.0, Wo) are from a logit of w; on 1, (x; — c) using c — h < x; < c, and similarly 
for he, y). 

In the FRD case, we have to choose two bandwidths (even assuming that we use 
symmetric bandwidths in estimating the regression functions on either side of the 


jump). We could use the same bandwidth for P(w = 1 |x) and E(y|x) or choose 
them separately using, say, Imbens and Kalyanaraman (2009). 


21.5.3 Unconfoundedness versus the Fuzzy Regression Discontinuity 


Unlike in the SRD case, it is possible that overlap can hold for the FRD (although 
it might be weak in practice). Therefore, we can compare regression adjustment to 
estimators that exploit the FRD. 

It is useful to return to the linear formulation: 


Y = Hoe + Tew + Bo(x — c) + Ow: (x — c) + uo + w(u — uo). (21.109) 


Under the ignorability assumption D(yo, yı |w, x) = D(yo, yı |x), the composite 
error has zero mean conditional on (w,x), and so OLS (or local regression) con- 
sistently estimates Te. In fact, if we believe unconfoundedness and the linear func- 
tional form, we can use all the data and average across x; to estimate Tate. 

If we only assume D(yı — yo|w,x) = D(yı — yo|x)—but allow up to be corre- 
lated with w—then the OLS estimator is inconsistent. Recall that the estimate of te 
can be written as 


e = M+ (c) — M (c), 
where m(x) is estimated using the w; = 1 observations and Mño(x) is estimated using 
the w; = 0 observations. In other words, the discontinuity in P(w = 1 |x) at x = c is 
essentially ignored. By contrast, the estimator in equation (21.107) is consistent under 
the weaker version of ignorability, and it directly exploits the jumps in the mean 
responses and treatment probabilities at x = c. 

A further benefit of the estimator in equation (21.107) is that it is consistent for 
ATE for compliers at x = c without unconfoundedness, provided we add a monoto- 
nicity assumption. See Imbens and Lemieux (2008) for a detailed treatment. 
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21.6 Further Issues 


We now discuss some special considerations and extensions ofthe basic methods 
previously discussed. 


21.6.1 Special Considerations for Responses with Discreteness or Limited Range 


We have seen that, when ignorability of selection holds, several methods of estimat- 
ing average treatment effects are available that depend in no essential way on the 
nature of y. Propensity score weighting and matching can be applied directly whether 
the response is continuous or discrete or has both features. Methods that rely on 
regression adjustment (parametric or even series estimation)—either by itself or in 
conjunction with other methods—work better when the chosen conditional mean 
functions are good approximations to E(y| w = 1,x) and E(y |w = 0,x). We already 
discussed how binary and fractional responses can be modeled using logistic and 
related functional forms (probably combined with the Bernoulli log likelihood) 
and how exponential regression functions can be used when y is nonnegative 
(possibly combined with the Poisson log likelihood). But if y has both discrete and 
continuous characteristics, we might want to try other models for the conditional 
expectations. For example, suppose yo and yı are corner solutions. Under Assump- 
tion ATE.1, D(y,|w,x) = D( y4 |x), g = 0, 1. Therefore, we can estimate models for 
D(y|w = 0,x) and D(y |w = 1,x) that account for the corner nature of y. This could 
be the standard Type I Tobit model if the corner is at zero, or a two-limit Tobit 
model as in Section 17.7 if there are two corners. But we might also try a 
hurdle model separately for w = 0 and w = 1. Remember, the idea is to eventually 
obtain good estimates of the conditional means, and a Tobit or hurdle model might 
do that better than, say, a linear model, even though the Tobit or hurdle model is 
itself misspecified. Once we have those estimates of the mean functions, we can use 
equation (21.30), as always. 

When we relaxed ignorability and allowed treatment to be correlated with unob- 
servables even after conditioning on x, the IV, correction function, and control 
function methods discussed in Section 21.4 relied heavily on linear (in parameters and 
unobservables) functional forms. Even though the linear functional forms can, under 
certain assumptions, consistently estimate a local average treatment effect, we may 
want to use nonlinear functions and try to better approximate Tare OF Ta. If the 
counterfactual responses are binary, we might specify 


yo = I [ao + xo + eo = 0] (21.110) 


yı = lx + xB, +e) = 0], (21.111) 
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where (eo, e1) is independent of (x, z) and each has a standard normal distribution. If 
we add, as in the linear case, 


w = 1[09 + x0; + 202 +a > 0), (21.112) 


where (eo, e1,a@) is independent of (x,z) and trivariate normal, then we can estimate 
(xo, 66) and (o,8})' by “selection” probits on the w; = 0 and w; = 1 subsamples, 
respectively; see Section 19.6.3. Then tyie(x) = O(a + x£) — (ĉo + xf) and 


tate = =N! vS o (ay F x;ĝi) — P(ĉo F xiĝo)]. (21.113) 


If we impose f; = By and e; = eo, we obtain the bivariate probit model that we dis- 
cussed in Section 15.7.3. Naturally, it is better to use the more flexible model unless 
evidence suggests it is not needed. 

If the y, are corner solutions, we might use the specification y, = max(0, a, + 
xf, + e,) and make the standard assumptions so that y, follows a Tobit. In fact, we 
could still assume that (eo, e1,a) is independent of (x,z) and trivariate normal, but 
now where o? = Var(e,) are variance parameters to be estimated. Although Tobit 
models with “endogenous switching” are not common, there is no reason they cannot 
be useful for estimating ATEs with corner solution responses. Statistically, estimation 
of the parameters for the control and treatment groups is the same as estimating a 
Tobit model with endogenous sample selection. (The case with B, = $o and e1 = eo is 
covered in Problem 17.6.) The estimate of Tare takes the usual form 


N 

tate = N! S~[m(& + xÊ, ô?) — m(Go + xiBo, 65)], (21.114) 
i=l 

where m(-,-) is the conditional mean function for a Tobit model. Hurdle models 

could be used, too, but estimation becomes even more complicated. 

For exponential response functions we can use the “‘selection correction” described 
in Section 19.6.4 on the control and treated samples, and then, as usual, construct an 
average of the difference in estimated counterfactual means. These can be applied 
under the assumption E( yy | ag, w, X, Z) = exp(d, + xB,) with a, independent of (x, z), 
g = 0,1 and suitable normality assumptions (at a minimum in the probit model for 
w). A similar approach can be used for fractional responses when just the counter- 
factural conditional means are assumed to be correctly specified. 


21.6.2 Multivalued Treatments 


So far, we have considered the case with a single treatment level, which we indicate 
as w; = | with w; = 0 indicating a control unit. In some cases, program participation 
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can take place at different levels—say, part-time or full-time—or there could be 
different options for treatment, such as different job training programs. Both cases 
require an extension of the previous framework and estimation method. 

Now suppose that the treatment variable can take on G + 1 different values, which 
we label {0,1,2,...,G}. Typically, zero indicates the control group and 1,...,G 
different levels or options for treatment. Thus, w; takes a value in {0,1,2,...,G}. 
Now, of course, there are G+ 1 counterfactual outcomes, which we denote, for a 
random draw i, {yig : g =9,1,...,G}. The observed response, y;, can be expressed 
as 


yi = L[w; = Ol yo + Uw; = lJ ya +--+ L[w; = G] yic. (21.115) 


As usual, we have available a set of covariates, x;. Define u, = E( Yig) as the popu- 
lation means of the counterfactuals. A sufficient ignorability for identifying the 
means is the conditional mean independence assumption 


E( yig | Wi Xi) = E( yy |X), g=0,1,...,G. (21.116) 
Under assumption (21.116) it follows easily that 
E(yi| wi, Xi) = L[w; = 0JE( yo | xi) + L[w: = 1JE(ya | xi) 

+: + 1[w; = GJE( yig | Xi), (21.117) 
which immediately shows that the mean function E( yy | x) is identified because 
E( y |x) = E(y|w = g,x). (21.118) 


We can estimate E(y|w = g,x) for each g, given a random sample, by restricting 
attention to units with w; = g. In other words, regression adjustment in the multiple- 
treatment case is an obvious extension of the case when w; is binary. The consid- 
erations hold here as in the binary case, including using particular functional forms 
that account for the nature of y and using nonparametric regression. 

Given conditional mean estimates {m,(x) : g = 0,1,..., G}, we can estimate the 
average treatment effect for treatment level / relative to g, say Tgh, as 


N 

gn, reg = NT! Y [n (xi) — mg(x;)].- (21.119) 
i=l 

Or, if aj, is the average treatment effect for those in either group g or h, Ĝgh,reg 

is obtained by averaging the differences mn (x;) — ,(x;) across the subsample with 

w; = g or w; = h. For example, perhaps g is a lower level of participation in job 
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training than A, and we want to estimate the average treatment effect of going from 
part-time to full-time for those who were in one of those treated groups. We can 
define a measure more like the average treatment effect as n, = E(yn — Yg |W = h), 
which would be particularly interesting when g = 0 is the control group. Now we 
would simply average 11,(x;) — ™,(x;) across the subsample w; = 

It is pretty clear that overlap is needed to estimate average treatment effects with 
multiple treatment levels. While we can get by with less for some cases, the most 
straightforward statement is based on the set of propensity scores (which Imbens, 
2000, calls the generalized propensity score): 


P(w = 1|x) =p,(x)>0, xe, g=0,...,G. (21.120) 


Because the propensity scores must sum to unity, assumption (21.120) rules out the 
case of unit probabilities for any g. 

Propensity score weighting can also be used under the previous ignorability and 
overlap assumptions. For example, using the same argument in Section 21.3.3, it is 
easily shown that 


l[w; = i 
Bop) = BY eh g=0,1,...,G, (21.121) 


and so consistent estimates of the counterfactual means take the form 


ee wi = gl) 
lig, ps =N DE F i) (21.122) 


where p,(-) are the estimated propensity scores. Given these estimates, we can form 
differences such as ĉgn ps = fy, ps — Âg pse Estimating the other kinds of treament 
effects mentioned previously requires more care, but is fairly straightforward. See, for 
example Imbens (2000) and Lechner (2001). 

To implement IPW estimation, we have to estimate the propensity scores. A com- 
mon parametric approach would be to use a multinomial logit (MNL) model with 
flexible functions in x;. Because we are just looking for good estimates of the proba- 
bilities, the MNL model will often work well, but other approaches—such as nested 
logit models—can be used. If the treatment categories are obviously ordered—say, w; 
is number of years in college for the population of high school graduates—one might 
try an ordered model, such as ordered logit. 

In addition to regression adjustment and propensity score weighting, one can, of 
course, use matching—either on the covariates or propensity scores. See, for exam- 
ple, Lechner (2001). 
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21.6.3 Multiple Treatments 


The case of multiple treatments is more difficult to handle, and in some cases relies 
on additional functional form restrictions. Of course, if w; is an M-vector of treat- 
ment variables and the x; are the controls, we can always work with models for 
E( y; | wi, x;) directly, but then we are circumventing the counterfactual framework. 

A fairly general framework—which actually encompasses the framework of the 
previous section—is to use a random coefficient setting. Let w; be 1 x M and c; an 
M x 1 vector of unobserved heterogeneity. Assume that 


Vi = WiC; = Wi Cil + rere + WiMCiM- (21.123) 


By choosing the cim as counterfactual outcomes and the wim as treatment indicators 
for each treatment level (including the control), we can put the multivalued treatment 
framework into the current setting. But we can also allow for truly different treat- 
ments. For example, suppose we have two programs, A and B, where participation is 
allowed in one or both. Then we can define w; to have four dummy variables indi- 
cating every possible treatment state: participation in neither, participation in A but 
not B, participation in B but not A, and participation in both. Or, some of the wim 
can be continuous treatments along with discrete treatments. Further, as described in 
Wooldridge (2004), the wim can be different functions of the same (nonbinary) treat- 
ment in order to make functional forms more flexible. If we want, we can set, say, 
w; = 1 so that there is an intercept in the equation; in other cases it is more conve- 
nient to have a full set of treatment indicators and exclude an intercept. 

Given covariates x;, the goal is to estimate E(e;|x; =x) and then, eventually, 
l = E(e;). When we use appropriate linear combinations, the latter corresponds to 
average treatment effects across the entire population while E(e;|x; = x) generally 
corresponds to conditional average treatment effects. 

If we assume ignorability, we can consistently estimate E(c; |x; = x), and, assum- 
ing overlap, we can then average to estimate ue. The key assumptions in the popula- 
tion are 


E(y|w,¢,x) = E(y|w,) (21.124) 
and 
E(w'w |e, x) = E(w’w|x). (21.125) 


In many cases, we would obtain assumption (21.125) from E(w|e,x) = E(w | x) and 
Var(w|c,x) = Var(w|x)—in other words, from ignorability conditional on the first 
two moments of the distribution of w given (c, x). 
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If we also assume that A(x) = E(w’w|x) is nonsingular, we can easily find an 
expression for E(e|x). Write y=we+wu where E(u|w,e,x) =0 by assumption 
(21.124). Then w'y = w'we + w’u, and so 


E(w’y |x) = E(w'we | x) + E(w’u| x) = E(w’we |x). 

Now, by iterated expectations, E(w’we | x) = E[E(w’we |c, x) | x], and 

E[E(w'we | ¢, x) | x] = E[E(w’w|c, x)e| x] = E[E(w’w| x)e| x] = E(w’w| x)E(c|x), 
where the second equality follows from assumption (21.125). We have shown that 
E(w’p| x) = E(w'w | x)E(e|x), 

and so, assuming invertibility of E(w’w| x), 

E(e|x) = [E(w'w| x) Ewy |x) = [A(x)}'E(w'y |x). (21.126) 


Because we can collect a random sample on (y;, w;, x;), we can estimate all the con- 
ditional moments that appear in E(c| x). 
For estimating the unconditional mean 4 it suffices to estimate A(x) because 


Me = E{[A(x)]  E(w'y |x) } = E{[A(x)]'w'y}, (21.127) 


which just uses the fact that w'y = E(w’y|x) +r with E(r| x) = 0. 

Expression (21.127) takes on a simple, recognizable form when w is a vector of 
mutually exclusive, exhaustive binary indicators for different treament groups. Then 
w'w = diag(wi, . . . , wim) and so A(x) is the M x M diagonal matrix with E(wm |x) = 


P(wWm = 1|x)—that is, the propensity scores—down the diagonal. It is easily seen 
that the mth element of E{[A(x)] 'w’y)} is simply 
E[Win Vi/Pm (x;)] > 


which is exactly the expression we derived in the previous subsection to motivate 
propensity score weighting in the multiple treatment case. 

More generally, A(x) might not be a diagonal matrix, although it always will be if 
we choose w; to saturate all possible treatment scenarios. In other words, if we have 
Q possible programs where, for program q there are J, treatment levels, then w; has 
Dii J, elements. (If certain combinations are impossible, they can just be excluded.) 
With continuous treatments, or where we, say, let each program have its own set 
of treatment effects and these do not interact with treatment effects from other pro- 
grams, then we generally have to estimate conditional covariances along with condi- 
tional means (and variances). Assuming we have a consistent estimator A(x) of A(x) 
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for each x, consistent estimators of u, have the form 
N A 

fig = N~' NOAGS] wi yi. (21.128) 
i=l 


In many cases it is useful to explicitly separate the random slopes from the random 
intercept, in which case we can write in the population 


E(y|w,¢) = E(y|w,e,x) = a + wb, 


where w is now a set of K treatment variables such that Q(x) = Var(w |x) is non- 
singular for all (relevant) values of x. Then, as can be derived from equation (21.127) 
using partitioned inverse (see also Wooldridge (2004)), 


E(b| x) = [Var(w |x)] 7! Cov(w, y| x) = [Q(x)]~' Cov(w, y |x) 
and 


My = E(b) = E{[Q(x)]“'[w — wo) y}, 


where w(x) = E(w|x) is the 1 x K vector of conditional mean functions. In this 
formulation, we see directly that the conditional mean and conditional variance- 
covariance matrix of the treatment w are needed to estimate the average slopes, sy. 
Because we assume a random sample on (w,x), these moments are generally identi- 
fied. In practice, we might use parametric models that account for the nature of the 
elements of w. Naturally, given such estimators, we estimate sy, as a sample average, 
fy = N72 12.) [w — WK) V7. 

When treatment is not ignorable for a set of covariates x, we need to obtain 
instrumental variables. Linear IV estimation is relatively straightforward provided 
unobserved heterogeneity is additive, as in Section 21.4.1. But allowing for general 
heterogeneous treatment effects without ignorability is difficult. Control function 
approaches would rely on 


E(y|w, x,z) = wE(e| w, x, z), 


where z is the vector of instruments, and the latter can be difficult to obtain when w 
is a vector. Wooldridge (2008) considered correction function approaches under the 
assumptions 


E(y|w,¢,x,Z) = E(y|w,c) (21.129) 
(which is not much of an assumption because ¢ can include lots of factors) and 


E(c|x, z) = E(e|x) = He + (X — 4, )I, (21.130) 
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which is a standard exclusion restriction on the instruments, z, along with linearity 
assumption. If we write Cm = Hem + (X — Hx) Ym + Um, then we have 


Yi = Me Wa + MegWia +++ + Mey Wim + Wa (Xi — Hx) Yi to + wie i — Ax) Pa 
+ Wii + + Wim vio. 


The correction function approach now entails finding E(wintim | Xi, Zi) and inserting 
these for the WimVim, and then applying instrumental variables using functions of 
(x;,z;). More specifically, suppose Wim = fm(Xi, Zi, Uim, am) and E(vin | Uim, Xi, Zi) = 
P,nUim. Then, with distributional assumptions on uim, we can estimate a and parame- 
ters in the distribution of uim to obtain E(winttim | Xi, Zi) = Am(Xi, Zi, Om), Where the 0,, 
are assumed to be identified and h,,(-) is known. (Section 21.4.2 considered the case 
where w; is binary and follows a probit; the correction function was the standard 
normal pdf evaluated at a linear index.) Now we have the estimating equation 


Vi = Mer Wil + Megwir +++ + Mem Wim + Wa (Xi — My) Yi + + Wine (Xi — Hx) 
+ pihi (xi, Zi, 01) +--+ + pyhu (Xi, Zir 0m) + ri (21.131) 
E(r; | Xj, Zi) = 0. 


After replacing ux with the sample average x, plugging in the 6,,, and specifying the 
instruments, we can estimate the equation by IV. Natural choices for the instruments 
are Ê(wim |X; z;) and Ê(wim | x;,Z;) - (x; — X), where the Ê(wim |x;,zZ;) are obtained 
from the specified distribution D(w;m | x;,z;). In the simple binary case where w; fol- 
lows a probit, we used the probit fitted values, ®;. 

A test for whether all interactions between the treatment variables and heteroge- 
neity have zero coefficients is a standard, heteroskedasticity Wald test of Ho : py =+- 
= py = 9 after IV estimation. If we want to allow some of the p,, to be nonzero, the 
variance matrix of all second-step estimators needs to be adjusted for the estimation 
of the Om, using either the delta method or bootstrapping. Notice that the coefficients 
on the wim are the estimated counterfactual means, and then we can use these to 
construct various average treatment effects. We also have direct estimates of how the 
treatment effects vary with x (because we have the 9,,). 

The correction function method is easy to apply when the marginal models for the 
treatment indicators are easy to obtain. For example, it is much easier if, say, Wim 
is assumed to follow a probit model than if a collection of indicators follows a mul- 
tivariate probit. The method applies also when some treatments have discrete and 
continuous characteristics, such as hours spent in a job training program. If wim fol- 
lows a Tobit model, then Am(Xi, Zi, Om) is tractable; see Wooldridge (2008). 
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21.6.4 Panel Data 


Using individual panel data with general patterns of treatments is complicated by the 
different kinds of treatment effects—static and dynamic—that one might like to 
consider. In this subsection, we consider the case where, in each time period ¢, unit 7 is 
part of a treatment or control group. Thus, wy is a binary variable where w; = 1 
means treatment in time ¢. Let w; = (wi1,..., wir) denote the entire history of treat- 
ment indicators. If we focuse on the ATEs of treatment in a particular time period, 
it is fairly straightforward to state ignorability assumptions conditional on unob- 
served heterogeneity. Let y;,(0) and y;,(1) denote the counterfactual outcomes in the 
untreated and treated states, respectively. Then ignorability conditional on unob- 
served heterogeneity c; and covariates x; is 


Elyi(g) | wi, ci, Xi] = Elye(g) |c xi], g = 9,1. (21.132) 


Notice that this is a strict exogeneity assumption on treatment assignment, condi- 
tional on (¢;, x;). Because yy = yi(0) + wiE[yir(1) — yir(0)], it follows that 


E( vir | Wi, Ci, Xi) = E( vir | Wit, Ci, Xi) 
= E[yir(0) |e, xi] + wiE[vie(1) — yu(0) |c; xi]. (21.133) 


If the treatment effect y;,(1) — yi(0) is constant for each z, say t,, and E[y;-(0) | ¢;, x;] 
is linear and additive with scalar heterogeneity, and only covariates at time ¢ appear, 
then 


E( Vi | Wi, Ci, Xi) = Cio + Lor + Xußor + TiWi, t= 1,...,T, 


and this leads to a standard fixed-effects or first-differencing analysis (especially if we 
assume time homogeneity of Bo, and t+). 

If we allow E[yi(1) — vir(O) | ¢;, x;] to depend on heterogeneity and covariates, we 
get an estimating equation where w; interacts with X; as well as with heterogeneity, 
say a;. So, assuming time homogeneity except in the intercepts «o, and the means of 
the covariates, the estimating equation looks like 


E( vit | Wi, Ci, Xi) = Cio + Xor + Xaho + GiWie + Wi(Xr —W,), tH=l,...,T, (21.134) 


where w, = E(x;,) (which should be replaced by the sample average for each ¢ in 
estimation). The average treatment effect is t = E(a;). We discussed how to estimate 
such models in Section 11.7, and we also noted that there are conditions where setting 
a; = T in estimation and using the usual FE estimator consistently estimates T. 

If we are willing to assume strictly exogenous treatment, conditional on heteroge- 
neity and covariates, then allowing lagged effects of treatment is straightforward in a 
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regression-type framework. As a practical matter, one just adds lags of treatment 
indicators to the standard models described previously. Then, one can estimate how 
long program participation effects last. Explicitly using a counterfactual setting is 
possible but notationally complicated. For example, suppose we are interested only 
in contemporaneous effects and effects lagged a year. Then, for each ¢ > 2, we have 
counterfactual outcomes yi:(g;-1,9:) where gı and g; can be zero or one, corre- 
sponding to treatment in the same time period or one time period earlier. In other 
words, at time ¢ there are four counterfactuals for unit i: y;,(0,0), yir(1,0), ya(0, 1), 
and yir(1,1). We can write the observed outcome as 


Vie = (1 — wi 1) (1 — wit) vir(0, 0) + wi, 1-11 — wir) yall, 0) 
+ (1 = wii) Wir Vir(0, 1) + Wi, t-1Wi Yull, 1) 
= vit(0,0) + wi, -i[vie(1, 0) — vie(O, 0)] + wielvie(O, 1) — vir(0, 0)] 
+ wi-1Wiel Vie, 1) — vi, 0) — yal, 1) + vie(0, 0)] 
If we modify assumption (21.132) to 
E[Vie(Ge-1, 91) | Wi, Ci Xi] = Elyie(G-1, 91) | €X]; 9-1, 9 = 0, 1, (21.135) 


then unobserved-effects models fall out naturally. The standard additive model 
with covariates dated at time ¢ (and possibly earlier time periods) emerges if only 
E[yir(0, 0) | ¢;,x;] depends on heterogeneity and covariates, but adding interactions 
with heterogeneity, and especially with observed covariates (contemporaneous or 
lagged) is straightforward. If we treat all means as constant over time—j,, = 
E|yir(g,4)|—then it is straightforward to estimate various treatment effects. For 
example, 41 — {oo is the average treatment effect from participating in neither period 
versus participating in both periods. 

We can also adapt an approach due to Altonji and Matzkin (2005); see also 
Wooldridge (2005a). Under assumption (21.135), 


E( vir | Wi, Ci, Xi) = E( Yu | Wi t—1; Wit, Ci, Xi) = (1 a Wi; —1)(1 T wir) E[vir(0, 0) | ¢;, Xi] 
+ wi r-1(1 — wie) E[yir(1, 0) | c; x] 
+ (1 = wii) wiE[yir(0, 1) |c, xi] + wi, -1WirE[ ie, 1) |c; x;] 


Unless we make special functional form assumptions of the kind just described, this 
equation is not directly useful because it depends on the unobserved heterogeneity, c;. 
Suppose, however, that we assume that the distribution of c; given (w;,x;) depends 
on a relatively simple function of the history of treatments, such as the fraction of 
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treated periods, w;. (There are more sophisticated possibilities, such as using fractions 
within various subperiods.) Formally, if D(¢; | w;, x;) = D(c; | W;, x;), then, by iterated 
expectations and assumption (21.135), 


E( vir | Wi, Xi) = E( Yir | Wi,t-1, Wit, Wi, Xi) = (1 — wi, -1 )(1 — wir) E[yir(0, 0) | W;, xi] 

+ wi -1(1 — wir) E| ya(l, 0) | W; xi] 

+ (1 = wi -1) witE[yi2(0, 1) | W;, x;] 

+ Wi,t-1WE[ yi, 1) | Wi, xi] (21.136) 


Actually, we could have derived this equation directly if we had just assumed that 
treatment at ¢— 1 and ¢ is unconfounded conditional on (i;,x;), but the current 
derivation makes a link to approaches with unobserved heterogeneity. 

It follows now from equation (21.136) that we can estimate E| yi(gi-1, gr) | Wi, xi] 
by estimating E( yi | Wi t-1 = Jt-1, Wit = Jt, Wi, Xi) = (91-1, Gt, Wi, Xi). That is, for 
each of the four treatment group combinations, we estimate regressions of y; on 
(Wi, Xi), or use quasi-MLE. (If x; is truly a sequence of covariates that change over 
time, we might restrict the way x; appears, such as (Xj, X;,;-1, X;), but this restriction 
is only intended to conserve on the dimension of the estimation problem.) Given the 
™i(9r-1, 91, Wi, Xi), we have 


N 
ELyir(ge-1,.90)) = N7! XO fulgi, go i, Xi); (21.137) 
i=l 
that is, we average out the control variables (w;,x;). This approach can be made 
quite flexible and allows estimation of average treatment effects that compare any of 
the four groups to any other group. And, of course, nothing about the method 
requires us to focus on only current and one lag of treatment status. The same gen- 
eral approach applies to histories g' = (g{,g3,...,g;) where each g! is zero or one. 
The previous approaches assume ignorability of treatment conditional on hetero- 
geneity, or a sufficient statistic for the entire treatment history (such as the time 
average), and a sequence of observed covariates. It may be more realistic to 
assume ignorability conditional on past observed outcomes, treatment assignments, 
and covariates. For example, if workers are being assigned into job training, an 
administrator may be more likely to make assignments on the basis of past observed 
labor market outcomes and past assignment status. General frameworks can be 
found in Gill and Robins (2001), Lechner and Miquel (2005), and Abbring and 
Heckman (2007). Here I follow Lechner (2004) and use a dynamic ignorability 
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assumption. The setup is that for a total of T time periods we have a balanced panel. 
For each i, we observe the sequence of binary treatment indicators, {wi : t = 
1,..., T}. Let the 1 x ¢ vector g’ be defined as in the previous paragraph: a sequence 
of zeros and ones indicating a treatment regime through time ¢. We are interested 
in the counterfactual outcomes y;(g‘), the outcome under regime g’. (Incidentally, 
unlike in some approaches to treatment effect estimation with panel data, we do not 
have to consider counterfactual outcomes as a function of future treatments. There- 
fore, we need not put restrictions on the counterfactuals such as the “no anticipation” 
requirement; see, for example, Abbring and Heckman (2007).) Naturally, we can 
write the observed outcome, yi, as a function of the y;(g‘) for g' e %', the set of all 
valid treatment regimes. Further, at each time ¢ we observe a set of covariates, x} 
such that x5! c xt, r= 2,...,7. At a minimum, x! would typically include all past 
observed outcomes, {y;+-1,---, Vi}, but it can also include time-constant variables, 
say, Z;, collected in the intial period. In some cases, x/ can contain variables dated 
at time ¢, but one must be sure that such variables are not themselves affected by 
treatment. 
We state the dynamic ignorability assumption (for each £) as 


D[wir | Valg), Wir, ~~ +) Wily Xi] 

= D[wir|wir-1,---,wa, x], 2 €9 r<t. (21.138) 
When r = ż¢, this condition is 
D[wie | vie(g’), Wi,t-1,---,Wa, Xi] = D[wit | wit-i,-.--,wa, xj], 2° EG’, 


which says that, conditional on past assignments and outcomes (contained in xý), 
assignment is independent of the counterfactual outcomes. Lechner (2004) refers to 
this as the weak dynamic conditional independence assumption. Equation (21.138) 
requires that ignorability of assignment with respect to the counterfactual at time t 
hold in periods before t. 

With a suitable overlap assumption, which will become clear shortly, we can show 
that equation (21.138) is sufficient to identify the means E[y;,(g‘)], which we can 
compare across different g to obtain average treatment effects. To simplify the 


notation, consider the mean E[y;(1‘)| where 1‘ = (1,1,...,1), a t-vector of ones. 
This corresponds to treatment in every time period. Define 
Pir(X;) = Pwy = 1| wir- = 1,.-., wa =1,x)), r<t (21.139) 


(which is specific to the particular treatment sequence we are considering). By as- 
sumption (21.138), this is also the probability conditional on y;,(1‘). Now, we want 
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to show that a particular kind of inverse probability weighting identifies E[y;,(1‘)). 
(This is similar to the attrition problem in Section 19.9.3 except that here we assume 
that the same units are observed in each time period. Thus, though there is a “missing 
data” problem in that we do not observe y;,(1‘), we assume that, at a given time 
period t, for each unit we observe the entire history of past outcomes and other 
covariates.) The selected, weighted outcome y; for g’ = (1,...,1) is 


WiWi,t—1 °°° Wi Yit _ WuWi, i1 + Wa Yall’) 
Pixi) pa (xi) Pi(X}) ++ pa xi) 


(21.140) 


where the equality follows because w)w;;-1--:wia = 1 implies that yy = ya(1’). 
Because we have divided by the propensity scores in equation (21.140), we are, of 
course, assuming they are nonzero. This is the overlap assumption for this particular 
counterfactual outcome. If, for example, it is impossible to receive treatment in 
every period, then E[y;(1‘)] is generally unidentified. Essentially, we must restrict 
attention to averages of counterfactual outcomes that correspond to possible treat- 
ment sequences. 

We now show that the expected value of the second term is E[y;,(1')]. We do so 
sequentially by iterated expectations and assumption (21.138). First, 


WiWir-1 + Wa Yuld | { E 1-1 Wa Yuld’) 4 7 
E =E? E Yall’), wi, -fyse Wila X; 
Dit(x/) «+» pa(X}) Dit(X!) ++ pi(X}) ne 


Now, because x/~! c x/, the denominator is a function of the conditioning variables, 


and the conditional expectation of the numerator is 

Ewa | yaQ’), Wis t1; +++ y Wily Xi |WieWi teat wa yall’) 
= P(wi =1 | Wi t—l;- -+3 Wil, X;)WitWi, lt wa yall a) 
= P(wir =1 | Wi 1 = 1,...,Wa = 1, x/)wiwi, PS wi yalt’) 
= pul X) wii tr Wi Yuld’), 


where the first equality follows from assumption (21.138) and the second follows 
because the term is zero unless all wis, s= 1,...,t— 1, are unity. Therefore, we have 
shown that 


Wi, -1° Wa Yalt’) 


Pi 1 (xf!) n - pa (X}) ? 


t 
WiWi, 1-1 Wi Vit") 


E 
Pit(X}) +> Pa (x/) 


t\ ai il 
Yull );, Wi t-15: +- Wil; X; = 


and so 
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Wir Wa Yall’ Wii War Vir 
pee l wa yell | =| ee aa ) (21.141) 
Pit(X}) +++ pi (X}) Piai ) +++ Pa) 
Now we can repeat the argument conditioning on [yi(1'), Wi t-2;.--, wi, xi] and 
use assumption (21.138) for r = t — 1. And so on, until we get the expectation to 
“a Yuld! 
el but | = Elyie(1')], 
Pit (X; ) 


where one final application of assumption (21.138) and iterated expectations gets us 
the desired equality. 
Given that 


= ELyi(1)], (21.142) 


os Hie a 
pixi) p(X) 
where all quantities in the left-hand-side expectation are either observed or estimable, 
we can consistently estimate E[y;,(1‘)] by 


a WitWi, t—1 °° Wil Vit 
i=l Pi(X}) a -Êa (X}) 


Elya(1')] =N! ; (21.143) 


where p;,(x!) is an estimate of P(w» = 1|w;,-1 =1,.-., wa = 1,x/). We can esti- 
mate these probabilities very generally. With lots of data, we might estimate a flexible 
binary response model for w; on the subsample with w; ,=1 = 1,...,wi = 1 with x; 
as the covariates. Or, we might estimate a model for P(w} = 1 |w; r-1,..-, Wa, X7) 
using all 7, and then insert w; ,=1 = 1,..., wa = 1. 

The approach for any sequence of potential treatments g’ should be clear. We 
use the treatment indicators to select out the appropriate subsample, and then esti- 
mate P(w} = g} | Wir-1 = Oh 4yss Wat = gj, %,) for r= 1,...,t. For example, with 
g‘ = (0,0,...,0) we select out the subsample using (1 — w;,)---(1 — wy) and then 
weight by the inverse of the product of the conditional probabilities. By estimating 
several combinations of treatment patterns we can obtain distributed lag effects of the 
policy intervention on current and future outcomes. 

The two approaches just described—ignorability conditional on unobserved het- 
erogeneity and dynamic ignorability—are generally consistent under different as- 
sumptions. Nevertheless, it is sometimes (at least implicitly) claimed that dynamic 
ignorability is less restrictive. To see that this claim can seem to be true but is not, it is 
useful to work through a simple case. 

Suppose that T = 2 and there is no treatment in the first time period, so w; = 0. 
Consequently, the observed outcome in period one is y;;, and there is no need 
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to distinguish between it and counterfactual outcomes. In period 2 the counter- 
factual outcomes are yj2(0) and yj(1), and the observed outcome is yj. = y2(0) + 
w7[¥i2(1) — yi2(0)]. Assume that the treatment effect is the constant t. If we assume 
ignorability conditional on heterogeneity c;, then we can write yj) = C; + Ui, V2 = 
Ci +y+tWwi2+ ui, where y is the difference in intercepts between the two time 
periods (assumed to be constant). These simple representations lead to a simple dif- 
ferenced equation: 


Ayn = + twn + Aun, (21.144) 


where Ayp = yn — yi) and we use the fact that w; = 0. If we estimate equation 
(21.144) by OLS, the OLS estimate is the first-difference (FD) estimate, 


TrD = AMY treat T AY control» (21 $ 145) 


which is the difference in average changes over time between the treatment and con- 
trol groups. This estimator is consistent because treatment is strictly exogenous con- 
ditional on the heterogeneity. 

A commonly used alternative in the statistics literature is to add the first-period 
outcome as a control, that is, use the regression 


Ay2 on l,wpn, yia. (21.146) 


If we add the assumption that the shocks {u;, uin} are serially uncorrelated— 
specifically, E(u; u;i | wi1, ci) = 0—then adding y; overcontrols and leads to incon- 
sistency. In fact, it can be shown (for example, Angrist and Pischke (2009), Section 
5.4) that 


plim(ĉrpr) = t + m (a, /o}), (21.147) 


where w2=20+1 ya +r2 is a linear projection, and so 2, = Cov(c;, wi2)/ 
(a2 +03). We can easily sign the bias given the sign of zı. For example, if wi 
indicates a job training program and less productive workers are more likely to 
participate (zı < 0), then the regression that controls for y; underestimates the job 
training effect. If more productive workers participate, it overestimates the effect of 
job training. This simple example illustrates a general lesson in estimating treatment 
effects, whether one uses regression, propensity score weighting, or matching: it is not 
necessarily true that controlling for more covariates results in less bias. This example 
shows that adding covariates can actually induce bias when there was none. 

Now suppose that ignorability of treatment holds conditional on y; (and the 
treatment effect is constant). Then we can write 
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Ayn = y + twp + Aya + en 
E(e2)=0, Cov(wi, e2) = Cov( yi, en) = 0. 


Now, of course, controlling for y; consistently estimates t because we are assuming 
that controlling for y; (in a linear way) leads to ignorable treatment. On the other 
hand, the FD estimator suffers from omitted variable “bias”: 


Cov(wi2, Vi) 


21.14 
Var(w7) ( a) 


plim(ĉrp) =t+4 
Suppose there is regression to the mean, so that À < 0. If workers observed with low 
first-period earnings are more likely to participate, so that Cov(wj, yi) < 0, then 
plim(trp) > t, and so FD overestimates the effect. In fact, if the correlation between 
wpn and yj; is negative in the sample, tpp > ĉzpy is an algebraic fact. But this does 
not allow us to determine which estimator has less inconsistency because the deriva- 
tions rely on different ignorability assumptions. 


Problems 


21.1. Consider the difference-in-means estimator, d = Jı — Jo, where J, is the sam- 
ple average of the y; with w; = g, g = 0,1. 

a. Show that, as an estimator of ta, the bias in Jı —J9 is E(yo|w=1)— 
E(yo | w = 0). 

b. Let yo be the earnings someone would earn in the absence of job training, and let 
w= 1 denote the job training indicator. Explain the meaning of E(yọ|w = 1) < 


E(y% |w = 0). Intuitively, does it make sense that E(d) < Tat? 


21.2. Explain how you would estimate tge,g = E( yı — yo |x € #) using propensity 
score weighting under Assumption ATE.1’. 


21.3. Use the data in JTRAIN3.RAW to answer these questions. The response in 
this case is the binary variable unem78. 


a. Estimate Tare from the simple regression of unem78; on 1, train;. Does the training 
program have the anticipated effect on being unemployed? 

b. Add as controls the variables age, educ, black, hisp, married, re74, re75, unem75, 
and unem74. How does the estimate of Tate compare with the simple regression esti- 
mate from part a? What is the (heteroskedasticity-robust) 95% confidence interval? 
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c. Using the same controls as in part b, estimate separate regressions for the treat- 
ment and control groups. Now what is ĉaste? Does it differ much from the estimate in 
part b? What is ĉan? 


d. Now use regression adjustment with controls age, educ, black, hisp, married, re74, 
and re75, but use only the subsample of men who were unemployed in 1974 or 1975. 
Again use two separate regression functions. What are ĉaste and ĉan? How do these 
compare with the estimates from part c? 


e. Estimate the propensity score using logit and the explanatory variables in part b. 
How many outcomes are perfectly predicted? What does this result mean for IPW 
estimation? 


f. Now estimate the propensity score using only observations with avgre < 15. Use 
the p(x;) in IPW estimation to obtain ĉaste and ĉan. Be sure to only use observations 
with avgre < 15. Compare these with the corresponding regression adjustment esti- 
mates using the avgre < 15 observations. 


21.4. Carefully derive equation (21.80). 
21.5. Use the data in JTRAIN2.RAW for this question. 


a. As in Example 21.2, run a probit of train on 1, x, where x contains the covariates 
from Example 21.2. Obtain the probit fitted values, say ®;. 


b. Estimate the equation re78; = yọ + T train; + x;y + u; by IV, using instruments 
(1, ®;,x;). Comment on the estimate of t and its standard error. 


c. Regress Ê; on x; to obtain the R-squared. What do you make of this result? 
d. Does the nonlinearity of the probit model for train allow us to estimate t when we 
do not have an additional instrument? Explain. 


21.6. In Procedure 21.2, explain why it is better to estimate equation (21.71) by IV 
rather than to run the OLS regression y; on 1, Gj, x;, G(x; — x), i=1,...,N. 


21.7. As a special case of the setup in Section 21.6.3 we can write y; = a; + biwi 
where w; is a scalar. If we define f = E(b;), then, under assumptions (21.124) and 
(21.125), 


_ pf i= Vx) Vi 
= et (Xi) \, 
where w(x;) = E(w; | x;) and w(x;) = Var(w; | x;). 


a. Suppose that the treatment variable w; is a nonnegative count variable (such as 
visits to one’s family physician during a year) and you think E(w |x) = exp(yọ + xy) 
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and Var(w|x) = exp(d9 + xô). Propose a method for estimating the mean and vari- 
ance parameters. {Hint: For the latter, define appropriate errors, say r = w — E(w |x), 
and note that E(r? |x) = Var(w|x).] 

b. Given the estimators from part a, suggest a consistent, /N-asymptotically normal 
estimator of p. How would you conduct inference on f}? 

c. Apply the method from parts a and b to the data in JTRAIN2.RAW, using as 
w the variable mostrn (months spent in job training) and y = re78. Use the same 
variables in x as Example 21.1. 

d. Now suppose w; is a proportion (such as proportion of lectures attended) and you 
think E(w|x) = A(yọ + xy) and Var(w |x) = ôo + ô1E(w |x) +o[E(w|x)]°, where 
A(-) is the logistic function. Now how would you estimate the mean and variance 
parameters? What practical problem might arise with the estimated variances? 

e. Apply the approach from part d to the data set ATTEND.RAW with w = atndrte, 
the proportion (not percentage) of lectures attended. For the response use y = stndfnl, 
for the elements of x use cubics in priGPA and ACT, and use the frosh and soph 
binary indicators. What is the estimate of f? How does it compare to the multiple 
regression estimate of stndfnl on atndrte and the given controls? 


21.8. In the IV setup of Section 21.6.3, suppose that b = p, and therefore we can 
write 


y=a+fßfw+e, E(e|a,x,z) = 0. 
Assume that conditions (21.129) and (21.130) hold for a. 


a. Suppose w is a corner solution outcome, such as hours spent in a job training 
program. If z is used as IVs for w in y = yọ + Pw + xy + r, what is the identification 
condition? 

b. If w given (x,z) follows a standard Tobit model, propose an IV estimator that 
uses the Tobit fitted values for w. 

c. If Var(e|a,x,z) = aż and Var(a|x,z) = a2, argue that the IV estimator from part 
b is asymptotically efficient. 

d. What is an alternative to IV estimation that would use the Tobit fitted values for 
w? Which method do you prefer? 

e. If b # p, but assumptions (21.129) and (21.130) hold, how would you estimate $? 


21.9. a. Using the data in JT[RAIN3.RAW, estimate the logit model for the pro- 
pensity score in Example 21.1, using the same explanatory variables described there. 
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Plot histograms for the train; = 1 and train; = 0 sample separately. What do you 
conclude about the overlap assumption? 

b. Now use the subsample with (re74 + re75)/2 < 15, and repeat the exercise from 
part a. How does the situation compare with the full data set? 


21.10. In this problem you are to show that one can improve efficiency in estimating 
a constant ATE under random assignment by including regressors (provided those 
regressors are also independent of assignment). Let w; be the treatment, and assume 
that w; is independent of (yio, i1,x;). Assume that ya — yj =T for all i. Thus, 
write 

Vi = Vio + TW; = Ho + TW; + Vio- 

a. Derive the asymptotic variance of the simple OLS estimator t from the regression 
of y; on 1, w; in terms of Var(vjo), p = P(w; = 1), and N. 


b. Write the linear projection of yj on (1,x;) as 
Yio = % + XiPo + uio 

E(u) = 0, E(x/uio) = 0. 

Show that we can write 

Vi = Xo + TW; + Xißo + tio, 


and explain why w; is uncorrelated with x; and uj. 

c. Using part b, show that the asymptotic variance of ĉ from the regression y; on 1, 
wi, X; is Var(uio)/[Np(1 — p)]. 

d. Show that if fọ # 0, Var(uio) < Var(vj). Conclude that the asymptotic variance 
of ĉ is strictly smaller than that of 7 if x; helps to predict y;o. 

e. Suppose that E( yi |x) # % + x;f)—that is, the linear projection differs from the 
conditional mean. In this situation, why might you use the simple difference-in-means 
estimator under random assignment. (Hint: Think about small-sample versus large- 
sample properties.) 


21.11. Suppose that we allow full slope, as well as intercept, heterogeneity in a 
linear representation of two counterfactual outcomes, 


Yio = aio + xibjo 


ya = ai + xiby. 
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Assume that the vector (x;,z;) is independent of (ajo, bio, aj, bj;1) which makes, as 
we will see, z; instrumental variables candidates in a control function or correction 
function setting. 

a. Define «y = E(aj,) and p, = E(by), g =0,1 and y = E(x;). Find u, = E(yig), 
g = 0,1 in terms of these population parameters. 

b. Let t = Tate = Ly — Mo be the ATE. Also, write dig = ay + cig and big = B, + fig. 
Show that 


Vi = Ho + Twi + (Xi — W)Bo + wilx; — w)d + cio + xifio + wie; + Wixid), (21.149) 


where ô = B, — Bo, ei = ca — Cio, and d; = fi; — fio. 

c. Assume that 

wi = 1[0o + x;0, + 2;02 + vi = 0] = 1[q,0 + vi = 0], 

and assume that E(c;o | Vi, Xi, Zi) = povi, E(fio | Vi, Xi, Zi) = movi, E(e;| Vi, Xi, Zi) = Evi, 
and E(d; | v;, X;, zi) = Qai. Find E( y; | vi, Xi, Zi). 

d. Assuming that v; is independent of (x;, z;) and has a standard normal distribution, 
use part c to find E(y;|w;, X; z;). (Hint: This should depend on the generalized 
residual A(w;, q;0) = w;A(q,8) — (1 — w;)A(—q;0), where 4(-) is the inverse Mills ratio.) 
e. Propose a two-step control function approach to estimating t. How does this differ 
from the method derived in Section 21.4.2? 

f. How would you obtain a valid standard error for ĉ from part e? 


g. How would you estimate Tate(Xx) = E(yı — yo |x)? 


21.12. Consider the same setup as Problem 21.11, under the same assumptions. 

a. Of the four unobserved terms in equation (21.149), which two have a zero mean 
conditional on (x;, z;)? 

b. Derive the correction functions for the other terms. 

c. Propose a two-step correction function estimator of t (and the other parameters). 
d. How would you test whether the correction functions are needed? Be very precise. 
21.13. For the sharp regression discontinuity design, what kind of local estimation 


method might you use if y; is a fractional response? Specifically describe the model 
and estimation method, and provide the formula for ĉe. 


21.14. Consider a treatment effect setup with a binary treatment, w;,, in each time 
period. We are interested in contemporaneous ATEs, Tate = E| yill) — yul0)| = 
Ha — Ho: Assume that for covariates Xi, Vir(g) = dig + XinB,, g = 0,1, where aig are 
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random variables and f, is assumed (for simplicity) to be constant over time. Write 
ditg = Aig + Cig, Where % = E(aig). Define y, = E(xir). 

a. Find ko, H1, and T; ate in terms of the dy, By and y,. 

b. Let yi = (1 — wi) yul) + WiYyiull) and let T; = Tiat for notational simplicity. 
Show that we can write 

Vit = Hio + TtWit + (Xir = W)Bo + Wit(Xit = y ,)d + Cito + Witeit, (21.150) 
where ô = fp; — By and e = Cin — Cin. 

c. We think in equation (21.150) that w is correlated with cjg and ep, possibly 
because these terms each contain time-constant heterogeneity and idiosyncratic 
unobservables that are related to treatment. Suppose we have a vector Z; of potential 


instrumental variables. We allow the time average of the instruments and covariates 
to be correlated with (cj, ex). In particular, 


Cio = (Xi — Mg) 1 + (Zi — Mz )S2 + ruo, E(rio | Xi, zi) = 0 
Cit = (Xi — Mg) + (Zi — Mz) + Vin, E(vir | xi, Zi) = 0, 


where x; and z; are the entire histories of the covariates and instruments. (Removing 
the population means of x; and Z; ensures cj and e; have zero means.) Rewrite 
equation (21.150) so that it includes the time averages. 


d. Assume that w;, can be expressed as 
Wit = 190 + XireO) + ZiO + X03 + 2/04 + qu = O| 
D(qir | xi, zi) = Normal(0, 1). 


Further, assume that E(cjo|qit,x;,Z;) = «oqu and E(ey|qi,Xi,Z;) = pqi. Find 
E( vit | Wit, Xi; Zi). 

e. Use the expression from part d to propose a two-step control function approach to 
estimating the qz. 


21.15. Use the data in CATHETER.RAW to answer this question. The treatment 
variable is rhc, which is unity if a patient received a right-heart catheterization and 
zero if not. The response variable is death, equal to unity if the patient died within 
180 days. The patients were all admitted to the intensive care unit, and so the death 
rate is very high: almost 65% of the patients died in the first six months. See Li, 
Racine, and Wooldridge (2008). 


a. How many observations are in the sample? What fraction of patients received the 
RHC treatment? 
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b. Use the difference in means to estimate the average treatment effect of rhc. Does 
the treatment reduce the probability of death? Is the estimate practically large and 
statistically significant? 


c. Now use regression adjustment, estimating logit models for death separately for 
rhc = 0 and rhc = 1. As covariates, use sex, race, income, catl, cat2, ninsclas, and 
age. For all but the age variable include dummy variables for all categories (except, 
of course, a base category). What is the regression adjustment estimate of Tate? Of 
Ta? Obtain standard errors using, say, 1,000 bootstrap replications. 


d. Estimate a logit model for rhc using the same covariates as in part c. Compare the 
range and average of the estimated propensity scores for the treated and untreated 
samples. Plot a histogram in each case, and comment on overlap. 


e. Estimate Tate and Tan by propensity score weighting. Again use 1,000 bootstrap 
replications to obtain standard errors. How do the estimates compare with those 
from part d? Overall, does it appear RHC reduces the probability of death? 


21.16. Use the data in REGDISC.RAW to answer this question. These are simu- 
lated data of a fuzzy regression discontinuity design with forcing variable x. The 
discontinuity is at x = 5. 


a. What fraction of the observations have x; > 5? What fraction of units are in the 
treated group (that is, have w; = 1)? 


b. Estimate separate linear probability models for x; < 5 and x; > 5, and obtain the 
fitted line. Do the same for logit models, and obtain the fitted logit functions. Graph 
these functions on the same graph. What is the estimated jump in the probability of 
treatment in each case? 

c. Compute the estimate of te with c = 5 in equation (21.107) using linear regression 
for y and w, and using all the data. 

d. Now use the IV estimate described in equation (21.108), again using all the data. 
Is it the same as the estimate in part c? What is its standard error? 

e. Now use a “‘local” version of the IV method, restricting attention to data with 
x; > 3 and x; < 7. What happens to the estimated t, and its standard error compared 
with that in part d? 


22 Duration Analysis 


22.1 Introduction 


Some response variables in economics come in the form of a duration, which is the 
time elapsed until a certain event occurs. A few examples include weeks unemployed, 
months spent on welfare, days until arrest after incarceration, and quarters until an 
Internet firm files for bankruptcy. 

The recent literature on duration analysis is quite rich. In this chapter we focus on 
the developments that have been used most often in applied work. In addition to 
providing a rigorous introduction to modern duration analysis, this chapter should 
prepare you for more advanced treatments, such as Lancaster’s (1990) monograph, 
van den Berg (2001), and Cameron and Trivedi (2005). 

Duration analysis has its origins in what is typically called survival analysis, where 
the duration of interest is survival time of a subject. In survival analysis we are 
interested in how various treatments or demographic characteristics affect survival 
times. In the social sciences, we are interested in any situation where an individual— 
or family, or firm, and so on—begins in an initial state and is either observed to exit 
the state or is censored. (We will discuss the exact nature of censoring in Sections 22.3 
and 22.4.) The calendar dates on which units enter the initial state do not have to 
be the same. (When we introduce covariates in Section 22.2.2, we note how dummy 
variables for different calendar dates can be included in the covariates, if necessary, 
to allow for systematic differences in durations by starting date.) 

Traditional duration analysis begins by specifying a population distribution for the 
duration, usually conditional on some explanatory variables (covariates) observed at 
the beginning of the duration. For example, for the population of people who became 
unemployed during a particular period, we might observe education levels, experi- 
ence, marital status—all measured when the person becomes unemployed—wage on 
prior job, and a measure of unemployment benefits. Then we specify a distribution 
for the unemployment duration conditional on the covariates. Any reasonable dis- 
tribution reflects the fact that an unemployment duration is nonnegative. Once a 
complete conditional distribution has been specified, the same maximum likelihood 
methods that we studied in Chapter 19 for censored regression models can be used. In 
this framework, we are typically interested in estimating the effects of the covariates 
on the expected duration. 

Recent treatments of duration analysis tend to focus on the hazard function. The 
hazard function allows us to approximate the probability of exiting the initial state 
within a short interval, conditional on having survived up to the starting time of the 
interval. In econometric applications, hazard functions are usually conditional on 
some covariates. An important feature for policy analysis is allowing the hazard 
function to depend on covariates that change over time. 


984 Chapter 22 


In Section 22.2 we define and discuss hazard functions, and we settle certain issues 
involved with introducing covariates into hazard functions. In Section 22.3 we show 
how censored regression models apply to standard duration models with single-cycle 
flow data, when all covariates are time constant. We also discuss the most common 
way of introducing unobserved heterogeneity into traditional duration analysis. 
Given parametric assumptions, we can test for duration dependence—which means 
that the probability of exiting the initial state depends on the length of time in the 
state—as well as for the presence of unobserved heterogeneity. 

In Section 22.4 we study methods that allow flexible estimation of a hazard func- 
tion, with both time-constant and time-varying covariates. We assume that we have 
grouped data; this term means that durations are observed to fall into fixed inter- 
vals (often weekly or monthly intervals) and that any time-varying covariates are 
assumed to be constant within an interval. We focus attention on the case with two 
states, with everyone in the population starting in the initial state, and single-cycle 
data, where each person either exits the initial state or is censored before exiting. 
We also show how heterogeneity can be included when the covariates are strictly 
exogenous. We touch on some additional issues in Section 22.5. 


22.2 Hazard Functions 


The hazard function plays a central role in modern duration analysis. In this section, 
we discuss various features of the hazard function, both with and without covariates, 
and provide some examples. 


22.2.1 Hazard Functions without Covariates 


Often in this chapter it is convenient to distinguish random variables from particular 
outcomes of random variables. Let T > 0 denote the duration, which has some dis- 
tribution in the population; ¢ denotes a particular value of T. (As with any econo- 
metric analysis, it is important to be very clear about the relevant population, a topic 
we consider in Section 22.3.) In survival analysis, T is the length of time a subject 
lives. Much of the current terminology in duration analysis comes from survival 
applications. For us, T is the time at which a person (or family, firm, and so on) 
leaves the initial state. For example, if the initial state is unemployment, T would be 
the time, measured in, say, weeks, until a person becomes employed. 
The cumulative distribution function (cdf) of T is defined as 


F(t)=P(T <t), t>0 (22.1) 
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The survivor function is defined as S(t) = 1 — F(t) = P(T > 2), and this is the prob- 
ability of “surviving” past time ¢. We assume in the rest of this section that T is 
continuous—and, in fact, has a differentiable cdf—because this assumption simplifies 
statements of certain probabilities. Discreteness in observed durations can be viewed 
as a consequence of the sampling scheme, as we discuss in Section 22.4. Denote the 


dF 
density of T by f(t) = 7 
For h > 0, 
P(t<T<t+h|T=2) (22.2) 


is the probabilty of leaving the initial state in the interval [t,¢+ 4) given survival up 
until time ¢. The hazard function for T is defined as 


P(t<T<t+h|T=2) 
h}0 h 


(22.3) 


For each z, A(t) is the instantaneous rate of leaving per unit of time. From equation 
(22.3) it follows that, for “small” h, 


P©<T<t+h|T >t) x Alh (22.4) 


Thus the hazard function can be used to approximate a conditional probability in 
much the same way that the height of the density of T can be used to approximate an 
unconditional probability. 


Example 22.1 (Unemployment Duration): If T is length of time unemployed, mea- 
sured in weeks, then 4(20) is (approximately) the probability of becoming employed 
between weeks 20 and 21. The phrase “becoming employed” reflects the fact that the 
person was unemployed up through week 20. That is, 2(20) is roughly the probability 
of becoming employed between weeks 20 and 21 conditional on having been unem- 
ployed through week 20. 


Example 22.2 (Recidivism Duration): Suppose T is the number of months before a 
former prisoner is arrested for a crime. Then 4(12) is roughly the probability of being 
arrested during the 13th month conditional on not having been arrested during the 
first year. 


We can express the hazard function in terms of the density and cdf very simply. 
First, write 


PST <I+A|T =) =P T< 1+ M/T >) =) 
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When the cdf is differentiable, we can take the limit of the right-hand side, divided by 
h, as h approaches zero from above: 


=n F(t +h) — F(t) 1 . f® fÒ 
er h 1-F() 1-F() S(s) a9) 
Because the derivative of S(t) is —f (t), we have 
_ d log S(t) 
A(t) = <a (22.6) 
and, using F(0) = 0, we can integrate to get 
t 
F(t) = 1 — exp - | A(s) a : t>0 (22.7) 
0 
Straightforward differentiation of equation (22.7) gives the density of T as 
t 
f(t) = A(t) exp -| A(s) a| (22.8) 
0 


Therefore, all probabilities can be computed using the hazard function. For example, 
for points a, < a, 


P(T > a|T >a\) = i Fey = exp -f A(s) as| 


and 


P(ay <T <a|T>a)=1~exp|-| 


ay 


A(s) a (22.9) 


This last expression is especially useful for constructing the log-likelihood functions 
needed in Section 22.4. 

The shape of the hazard function is of primary interest in many empirical appli- 
cations. In the simplest case, the hazard function is constant: 


A(t)=a, allt >0 (22.10) 


This function means that the process driving T is memoryless: the probability of exit 
in the next interval does not depend on how much time has been spent in the initial 
state. From equation (22.7), a constant hazard implies 


F(t) = 1 —exp(—A2) (22.11) 
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which is the cdf of the exponential distribution. Conversely, if T has an exponential 
distribution, it has a constant hazard. 

When the hazard function is not constant, we say that the process exhibits duration 
dependence. Assuming that (-) is differentiable, there is positive duration dependence 
at time ¢ if dA(t)/dt > 0; if da(t)/dt > 0 for all t > 0, then the process exhibits posi- 
tive duration dependence. With positive duration dependence, the probability of 
exiting the initial state increases the longer one is in the initial state. If the derivative 
is negative, then there is negative duration dependence. 


Example 22.3 (Weibull Distribution): If T has a Weibull distribution, its cdf is given 
by F(t) = 1 — exp(—yt”), where y and « are nonnegative parameters. The density is 
f(t) = yat™! exp(—yt*). By equation (22.5), the hazard function is 


A(t) = FOSA) = yat! (22.12) 


When « = 1, the Weibull distribution reduces to the exponential with 2 = y. If « > 1, 
the hazard is monotonically increasing, so the hazard everywhere exhibits positive 
duration dependence; for « < 1, the hazard is monotonically decreasing. Provided we 
think the hazard is monotonically increasing or decreasing, the Weibull distribution 
is a relatively simple way to capture duration dependence. 


We often want to specify the hazard directly, in which case we can use equation 
(22.7) to determine the duration distribution. 


Example 22.4 (Log-Logistic Hazard Function): The log-logistic hazard function is 
specified as 


E yat”! 
 L+ yee 


A(t) (22.13) 
where y and « are positive parameters. When « = 1, the hazard is monotonically 
decreasing from y at t = 0 to zero as t — oo; when « < 1, the hazard is also monot- 
onically decreasing to zero as t > œ, but the hazard is unbounded as ¢ approaches 
zero. When «> 1, the hazard is increasing until t= [(%—1)/y]'%, and then it 
decreases to zero. 

Straightforward integration gives 


[ a) ds = log(1 + yt*) = —log{(1 + pt?)~"] 
0 


so that, by equation (22.7), 
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F(t)=1-—(1+ yt)", 120 (22.14) 
Differentiating with respect to t gives 


f(t) = yatt (1 + yt)? 


Using this density, it can be shown that Y =log(7) has density g(y) = 
a expla(y — w)]/{1 + expla(y — )]}’, where u= —a7! log(y) is the mean of Y. In 
other words, log(T) has a logistic distribution with mean w and variance n?/(34°) 
(hence the name “‘log-logistic’’). 


22.2.2 Hazard Functions Conditional on Time-Invariant Covariates 


Usually in economics we are interested in hazard functions conditional on a set of 
covariates or regressors. When these do not change over time—as is often the case 
given the way many duration data sets are collected—then we simply define the 
hazard (and all other features of T) conditional on the covariates. Thus the condi- 
tional hazard is 


< >t 
jis sim P(t<T<t+h|T=t,.x) 
h0 h 


where x is a vector of explanatory variables. All the formulas from the previous 
subsection continue to hold provided the cdf and density are defined conditional on 
x. For example, if the conditional cdf F(-|x) is differentiable, we have 


E (22.15) 


where f (- |x) is the density of T given x. Often we are interested in the partial effects 
of the x; on A(t;x), which are defined as partial derivatives for continuous x; and as 
differences for discrete xj. 

If the durations start at different calendar dates—which is usually the case—we 
can include indicators for different starting dates in the covariates. These allow us to 
control for seasonal differences in duration distributions. 

An especially important class of models with time-invariant regressors consists of 
proportional hazard models. A proportional hazard can be written as 


A(t;x) = «(x)Ao(t) (22.16) 


where x(-) > 0 is a positive function of x and 2ọ(¢) > 0 is called the baseline hazard. 
The baseline hazard is common to all units in the population; individual hazard func- 
tions differ proportionately based on a function «(x) of observed covariates. 
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Typically, «(-) is parameterized as x(x) = exp(xf), where f is a vector of param- 
eters. Then 


log A(t; x) = x£ + log Ao(2) (22.17) 


and f; measures the semielasticity of the hazard with respect to x;. [If x; is the log of 
an underlying variable, say x; = log(z;), J; is the elasticity of the hazard with respect 
to Zz | 

Occasionally we are interested only in how the covariates shift the hazard function, 
in which case estimation of 29 is not necessary. Cox (1972) obtained a partial maxi- 
mum likelihood estimator for $ that does not require estimating 29(-). We discuss 
Cox’s approach briefly in Section 22.5. In economics, much of the time we are inter- 
ested in the shape of the baseline hazard. We discuss estimation of proportional 
hazard models with a flexible baseline hazard in Section 22.4. 

If in the Weibull hazard function (22.12) we replace y with exp(xf), where the first 
element of x is unity, we obtain a proportional hazard model with 2o(£) = «t®!. 
However, if we replace y in equation (22.13) with exp(xf)—which is the most com- 
mon way of introducing covariates into the log-logistic model—we do not obtain a 
hazard with the proportional hazard form. 


Example 22.1 (continued): If T is an unemployment duration, x might contain 
education, labor market experience, marital status, race, and number of children, all 
measured at the beginning of the unemployment spell. Policy variables in x might 
reflect the rules governing unemployment benefits, where these are known before 
each person’s unemployment duration. 


Example 22.2 (continued): To explain the length of time before arrest after release 
from prison, the covariates might include participation in a work program while in 
prison, years of education, marital status, race, time served, and past number of 
convictions. 


22.2.3 Hazard Functions Conditional on Time-Varying Covariates 


Studying hazard functions is more complicated when we wish to model the effects of 
time-varying covariates on the hazard function. For one thing, it makes no sense to 
specify the distribution of the duration T conditional on the covariates at only one 
time period. Nevertheless, we can still define the appropriate conditional probabilities 
that lead to a conditional hazard function. 

Let x(t) denote the vector of regressors at time t; again, this is the random vector 
describing the population. For t > 0, let X(¢), t > 0, denote the covariate path up 
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through time t: X(t) = {x(s): 0 < s < t}. Following Lancaster (1990, Chapter 2), we 
define the conditional hazard function at time t by 


AEX(D] = lim PEST <t+AlT > AXU +h) 


22.1 
h0 h ( 8) 


assuming that this limit exists. A discussion of assumptions that ensure existence of 
equation (22.18) is well beyond the scope of this book; see Lancaster (1990, Chapter 
2). One case where this limit exists very generally occurs when T is continuous and, 
for each ¢t, x(t + h) is constant for all A € [0, y(t)] for some function y(t) > 0. Then we 
can replace X(t + h) with X(t) in equation (22.18) [because X(t + A) = X(t) for h 
sufficiently small]. For reasons we will see in Section 22.4, we must assume that time- 
varying covariates are constant over the interval of observation (such as a week or a 
month), anyway, in which case there is no problem in defining equation (22.18). 

For certain purposes, it is important to know whether time-varying covariates are 
strictly exogenous. With the hazard defined as in equation (22.18), Lancaster (1990, 
Definition 2.1) provides a definition that rules out feedback from the duration to 
future values of the covariates. Specifically, if X(t, t+ h) denotes the covariate path 
from time ż to t + h, then Lancaster’s strict exogeneity condition is 


P[X(t,¢-+h)|T > t+ h, X(i)] = P[X(t,¢+h)|X(d] (22.19) 


for all ż > 0, h > 0. Actually, when condition (22.19) holds, Lancaster says {x(f): 
t > 0} is “exogenous.” We prefer the name “strictly exogenous” because condition 
(22.19) is closely related to the notions of strict exogeneity that we have encoun- 
tered throughout this book. Plus, it is important to see that condition (22.19) has 
nothing to do with contemporaneous endogeneity: by definition, the covariates are 
sequentially exogenous (see Section 7.4) because, by specifying 2[t; X(¢)], we are con- 
ditioning on current and past covariates. 

Equation (22.19) applies to covariates whose entire path is well defined whether or 
not the agent is in the initial state. One such class of covariates, called external 
covariates by Kalbfleisch and Prentice (1980), has the feature that the covariate path 
is independent of whether any particular agent has or has not left the initial state. In 
modeling time until arrest, these covariates might include law enforcement per capita 
in the person’s city of residence or the city unemployment rate. 

Other covariates are not external to each agent but have paths that are still defined 
after the agent leaves the initial state. For example, marital status is well defined be- 
fore and after someone is arrested, but it is possibly related to whether someone has 
been arrested. Whether marital status satisfies condition (22.19) is an empirical issue. 
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The definition of strict exogeneity in condition (22.19) cannot be applied to time- 
varying covariates whose path is not defined once the agent leaves the initial state. 
Kalbfleisch and Prentice (1980) call these internal covariates. Lancaster (1990, p. 28) 
gives the example of job tenure duration, where a time-varying covariate is wage paid 
on the job: if a person leaves the job, it makes no sense to define the future wage path 
in that job. As a second example, in modeling the time until a former prisoner is 
arrested, a time-varying covariate at time tf might be wage income in the previous 
month, ¢ — 1. If someone is arrested and reincarcerated, it makes little sense to define 
future labor income. 

It is pretty clear that internal covariates cannot satisfy any reasonable strict exo- 
geneity assumption. This fact will be important in Section 22.4 when we discuss esti- 
mation of duration models with unobserved heterogeneity and grouped duration 
data. We will actually use a slightly different notion of strict exogeneity that is directly 
relevant for conditional maximum likelihood estimation. Nevertheless, it is in the 
same spirit as condition (22.19). 

With time-varying covariates there is not, strictly speaking, such a thing as a pro- 
portional hazard model. Nevertheless, it has become common in econometrics to call 
a hazard of the form 


Alt; x(t)] = «[x(2)]Zo(4) (22.20) 


a proportional hazard with time-varying covariates. The function multiplying the 
baseline hazard is usually «[x(t)] = exp[x(t)f]; for notational reasons, we show this 
depending only on x(¢) and not on past covariates [which can always be included in 
x(t)]. We will discuss estimation of these models, without the strict exogeneity as- 
sumption, in Section 22.4.2. In Section 22.4.3, when we multiply equation (22.20) by 
unobserved heterogeneity, strict exogeneity becomes very important. 

The log-logistic hazard is also easily modified to have time-varying covariates. One 
way to include time-varying covariates parametrically is 


Alt; x(t)] = exp[x(2)B]ar*'/{1 + exp (pl) 


We will see how to estimate « and $ in Section 22.4.2. 


22.3 Analysis of Single-Spell Data with Time-Invariant Covariates 


We assume that the population of interest is individuals entering the initial state 
during a given interval of time, say [0,5], where b > 0 is a known constant. (Naturally, 
“individual” can be replaced with any population unit of interest, such as “family” 
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or “‘firm.”’) As in all econometric contexts, it is very important to be explicit about the 
underlying population. By convention, we let zero denote the earliest calendar date 
that an individual can enter the initial state, and b is the last possible date. For ex- 
ample, if we are interested in the population of U.S. workers who became unem- 
ployed at any time during 1998, and unemployment duration is measured in years 
(with .5 meaning half a year), then b = 1. If duration is measured in weeks, then 
b = 52; if duration is measured in days, then b = 365; and so on. 

In using the methods of this section, we typically ignore the fact that durations are 
often grouped into discrete intervals—for example, measured to the nearest week or 
month—and treat them as continuously distributed. If we want to explicitly recog- 
nize the discreteness of the measured durations, we should treat them as grouped 
data, as we do in Section 22.4. 

We restrict attention to single-spell data. That is, we use, at most, one completed 
spell per individual. If, after leaving the initial state, an individual subsequently 
reenters the initial state in the interval [0,5], we ignore this information. In addition, 
the covariates in the analysis are time invariant, which means we collect covariates on 
individuals at a given point in time—usually, at the beginning of the spell—and we 
do not re-collect data on the covariates during the course of the spell. Time-varying 
covariates are more naturally handled in the context of grouped duration data in 
Section 22.4. 

We study two general types of sampling from the population that we have de- 
scribed. The most common, and the easiest to handle, is flow sampling. In Section 
22.3.3 we briefly consider various kinds of stock sampling. 


22.3.1 Flow Sampling 


With flow sampling, we sample individuals who enter the state at some point during 
the interval [0,b], and we record the length of time each individual is in the initial 
state. We collect data on covariates known at the time the individual entered the initial 
state. For example, suppose we are interested in the population of U.S. workers who 
became unemployed at any time during 1998, and we randomly sample from U.S. 
male workers who became unemployed during 1998. At the beginning of the unem- 
ployment spell we might obtain information on tenure in last job, wage on last job, 
gender, marital status, and information on unemployment benefits. 

There are two common ways to collect flow data on unemployment spells. First, 
we may randomly sample individuals from a large population, say, all working-age 
individuals in the United States for a given year, say, 1998. Some fraction of these 
people will be in the labor force and will become unemployed during 1998—that is, 
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enter the initial state of unemployment during the specified interval—and this group 
of people who become unemployed is our random sample of all workers who become 
unemployed during 1998. Another possibility is retrospective sampling. For example, 
suppose that, for a given state in the United States, we have access to unemployment 
records for 1998. We can then obtain a random sample of all workers who became 
unemployed during 1998. 

Flow data are usually subject to right censoring. That is, after a certain amount of 
time, we stop following the individuals in the sample, which we must do in order to 
analyze the data. (Right censoring is the only kind that occurs with flow data, so we 
will often refer to right censoring as “censoring” in this and the next subsection.) For 
individuals who have completed their spells in the initial state, we observe the exact 
duration. But for those still in the initial state, we only know that the duration lasted 
as long as the tracking period. In the unemployment duration example, we might 
follow each individual for a fixed length of time, say, two years. If unemployment 
spells are measured in weeks, we would have right censoring at 104 weeks. Alter- 
natively, we might stop tracking individuals at a fixed calendar date, say, the last 
week in 1999. Because individuals can become unemployed at any time during 1998, 
calendar-date censoring results in censoring times that differ across individuals. 


22.3.2 Maximum Likelihood Estimation with Censored Flow Data 


For a random draw i from the population, let a; € [0,5] denote the time at which in- 
dividual i enters the initial state (the “starting time”), let ¢* denote the length of time 
in the initial state (the duration), and let x; denote the vector of observed covariates. 
We assume that t* has a continuous conditional density f(t|x;;6), t > 0, where 0 is 
the vector of unknown parameters. 

Without right censoring we would observe a random sample on (qj, t7,x;), and 
estimation would be a standard exercise in conditional maximum likelihood. To ac- 
count for right censoring, we assume that the observed duration, 1;, is obtained as 


fi = min(f;, ci) (22.21) 


where c; is the censoring time for individual i. In some cases, c; is constant across i. 
For example, suppose tř is unemployment duration for person i, measured in weeks. 
If the sample design specifies that we follow each person for at most two years, at 
which point all people remaining unemployed after two years are censored, then c = 
104. If we have a fixed calendar date at which we stop tracking individuals, the cen- 
soring time differs by individual because the workers typically would become unem- 
ployed on different calendar dates. If b = 52 weeks and we censor everyone at two 
years from the start of the study, the censoring times could range from 52 to 104 weeks. 
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We assume that, conditional on the covariates, the true duration is independent of 
the starting point, a;, and the censoring time, c;: 


D(¢; | Xi, 47, ci) = D(t | xi) (22.22) 


where D(-|-) denotes conditional distribution. Assumption (22.22) clearly holds 
when a; and c; are constant for all i, but it holds under much weaker assumptions. 
Sometimes c; is constant for all i, in which case assumption (22.22) holds when the 
duration is independent of the starting time, conditional on x;. If there are sea- 
sonal effects on the duration—for example, unemployment durations that start in 
the summer have a different expected length than durations that start at other times 
of the year—then we may have to put dummy variables for different starting dates 
in x; to ensure that assumption (22.22) holds. This approach would also ensure 
that assumption (22.22) holds when a fixed calendar date is used for censoring, 
implying that c; is not constant across i. Assumption (22.22) holds for certain non- 
standard censoring schemes, too. For example, if an element of x; is education, as- 
sumption (22.22) holds if, say, individuals with more education are censored more 
quickly. 

Under assumption (22.22), the distribution of ¢7 given (x;, a;,c;) does not depend 
on (a;,c;). Therefore, if the duration is not censored, the density of t; = ¢ given 
(x; ai, ci) is simply f(t|x;;0). The probability that t; is censored is 


P(t? > c¢;|x;) = 1 — F(c;| xi; 0) 


where F(t|x;;@) is the conditional cdf of ¢7 given x;. Letting d; be a censoring indi- 
cator (d; = 1 if uncensored, d; = 0 if censored), the conditional likelihood for obser- 
vation i can be written as 


f(t) |x)" [1 — F(t;|x;,)]!0 (22.23) 


Importantly, neither the starting times, a;, nor the length of the interval, b, plays a 
role in the analysis. (In fact, in the vast majority of treatments of flow data, b and a; 
are not even introduced. However, it is important to know that the reason a; is not 
relevant for the analysis of flow data is the conditional independence assumption in 
equation (22.22).) By contrast, the censoring times c; do appear in the likelihood for 
censored observations because then t; = c;. Given data on (¢;,d;,x;) for a random 
sample of size N, the maximum likelihood estimator of 0 is obtained by maximizing 


N 
X {di log| f(t: | xi; 4)] + (1 — di) log[l — F(t: | xi; 8))} (22.24) 
i=1 
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For the choices of f(-|x;@) used in practice, the conditional MLE regularity 
conditions—see Chapter 13—hold, and the MLE is v N-consistent and asymptoti- 
cally normal. (If there is no censoring, d; = 1 for all i and the second term in 
expression (22.24) is simply dropped.) 

Because the hazard function can be expressed as in equation (22.15), once we 
specify f, the hazard function can be estimated once we have the MLE, 6. For ex- 
ample, the Weibull distribution with covariates has conditional density 


f(t|x;0) = exp(x;p)at™! exp[—exp(x;f)r’] (22.25) 


where x; contains unity as its first element for all i. (We obtain this density from 
Example 22.3 with y replaced by exp(x;f).) The hazard function in this case is simply 
A(t; x) = exp(xf)at™!. 


Example 22.5 (Weibull Model for Recidivism Duration): Let durat be the length 
of time, in months, until an inmate is arrested after being released from prison. 
Although the duration is rounded to the nearest month, we treat durat as a continu- 
ous variable with a Weibull distribution. We are interested in how certain covariates 
affect the hazard function for recidivism, and also whether there is positive or nega- 
tive duration dependence, once we have conditioned on the covariates. The variable 
workprg—a binary indicator for participation in a prison work program—is of par- 
ticular interest. 

The data in RECID.RAW, which comes from Chung, Schmidt, and Witte (1991), 
are flow data because it is a random sample of convicts released from prison during 
the period July 1, 1977, through June 30, 1978. The data are retrospective in that they 
were obtained by looking at records in April 1984, which served as the common 
censoring date. Because of the different starting times, the censoring times, c;, vary 
from 70 to 81 months. The results of the Weibull estimation are in Table 22.1. 

In interpreting the estimates, we use equation (22.17). For small B,, we can multi- 
ply the coefficient by 100 to obtain the semielasticity of the hazard with respect to xj. 
(No covariates appear in logarithmic form, so there are no elasticities among the B,.) 
For example, if tserved increases by one month, the hazard shifts up by about 1.4 
percent, and the effect is statistically significant. Another year of education reduces 
the hazard by about 2.3 percent, but the effect is insignificant at even the 10 percent 
level against a two-sided alternative. 

The sign of the workprg coefficient is unexpected, at least if we expect the work 
program to have positive benefits after the inmates are released from prison. (The 
result is not statistically different from zero.) The reason could be that the program is 
ineffective or that there is self-selection into the program. 
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Table 22.1 
Weibull Estimation of Criminal Recidivism 
Explanatory Coefficient 
Variable (Standard Error) 
workprg 091 
(.091 
priors .089 
(.013 
tserved 014 
(.002 
felon —.299 
(.106 
alcohol 447 
(.106 
drugs 281 
(.098 
black 454 
(.088 
married —.152 
(.109 
educ —.023 
(.019 
age —.0037 
(.0005) 
constant —3.402 
(0.301) 
a 806 
(.031) 
Observations 1,445 
Log likelihood —1,633.03 


For large Ê; we should exponentiate and subtract unity to obtain the proportion- 
ate change. For example, at any point in time, the hazard is about 100[exp(.447) — 1] 
= 56.4 percent greater for someone with an alcohol problem than for someone 
without. 

The estimate of « is .806, and the standard error of å leads to a strong rejection of 
Ho: x = | against Ho: « < 1. Therefore, there is evidence of negative duration de- 
pendence, conditional on the covariates. This means that, for a particular ex-convict, 
the instantaneous rate of being arrested decreases with the length of time out of 
prison. Figure 22.1 provides a graph of the estimated Weibull hazard. 

When the Weibull model is estimated without the covariates, & = .770 (se = .031), 
which shows slightly more negative duration dependence. This is a typical finding in 
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Figure 22.1 
Weibull hazard of recidivism 


applications of Weibull duration models: estimated « without covariate tends to be 
less than the estimate with covariates. Lancaster (1990, Section 10.2) contains a the- 
oretical discussion based on unobserved heterogeneity. 


When we are primarily interested in the effects of covariates on the expected 
duration (rather than on the hazard), we can apply a censored Tobit analysis to the 
log of the duration. A Tobit analysis assumes that, for each random draw i, log(t;) 
given x; has a Normal(x;6, a°) distribution, which implies that r* given x; has a log- 
normal distribution. (The first element of x; is unity.) The hazard function for a log- 
normal distribution, conditional on x, is A(t; x) = h[(log t — xd)/o|/at, where A(z) = 
#(z)/[1 — ®(z)], ¢(-) is the standard normal probability density function (pdf), and 
@(-) is the standard normal cdf. The lognormal hazard function is not monotonic 
and does not have the proportional hazard form. Nevertheless, the estimates of the ô; 
are easy to interpret because the model is equivalent to 
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log(t;) = x;6 + ei (22.26) 


where e; is independent of x; and normally distributed. Therefore, the ô; are 
semielasticities—or elasticities if the covariates are in logarithmic form—of the 
covariates on the expected duration. 

The Weibull model can also be represented in regression form. When ¢; given x; 
has density (22.25), exp(x;f)(¢*)” is independent of x; and has a unit exponential 
distribution. Therefore, its natural log has a type I extreme value distribution; there- 
fore, we can write « log(t;) = —x;ß + ui, where u; is independent of x; and has density 
g(u) = exp(u) exp{exp(—u)}. The mean of u; is not zero, but, because u; is indepen- 
dent of x;, we can write log(t7) exactly as in equation (22.26), where the slope coef- 
ficents are given by 6; = —$;/«, and the intercept is more complicated. Now, e; does 
not have a normal distribution, but it is independent of x; with zero mean. Censoring 
can be handled by maximum likelihood estimation. The estimated coefficients can be 
compared with the censored Tobit estimates described previously to see if the esti- 
mates are sensitive to the distributional assumption. 

In Example 22.5, we can obtain the Weibull estimates of the 0; as ô; = -Ê /ĉ. (Some 
econometrics packages, such as Stata, allow direct estimation of the 0; and provide 
standard errors.) For example, Ôdrugs = —.281/.806 x —.349. When the lognormal 
model is used, the coefficient on drugs is somewhat smaller in magnitude, about 
—.298. As another example, Ôage = .0046 in the Weibull estimation and Sage = .0039 
in the lognormal estimation. In both cases, the estimates have f statistics greater than 
six. For obtaining estimates on the expected duration, the Weibull and lognormal 
models give similar results. The lognormal model fits the data notably better, with log 
likelihood = —1,597.06. This result is consistent with the findings of Chung, Schmidt, 
and Witte (1991). 

Importantly, the shapes of the lognormal and Weibull hazards are quite different. 
The lognormal hazard, which is plotted in Figure 22.2 with the covariates set at their 
mean values, first increases until about six months and then decreases thereafter. This 
shape implies that for a short period—roughly six months—the instantaneous prob- 
ability of being arrested for a new crime increases the longer an ex-convict has been 
released from prison. After that, the hazard falls to about .001 at 80 months. There- 
fore, conditional on an ex-convict being out of prison 6} years, the probability of 
being arrested for a new crime in the 81st month is roughly .001. The Weibull hazard 
implies that there is initially a greater than .01 instantaneous probability of being 
arrested. At 80 months, the probability of being arrested during the 81st month is 
about .005, or five times larger than that obtained from the lognormal model. 

Sometimes we begin by specifying a parametric model for the hazard conditional 
on x and then use the formulas from Section 22.2 to obtain the cdf and density. This 
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Lognormal hazard of recidivism 


approach is easiest when the hazard leads to a tractable duration distribution, but 
there is no reason the hazard function must be of the proportional hazard form. 


Example 22.6 (Log-Logistic Hazard with Covariates): A log-logistic hazard func- 
tion with covariates is 


A(t; x) = exp(xp)at™! /[1 + exp(x£)t"] (22.27) 
where xı = 1. From equation (22.14) with y = exp(xf), the cdf is 
F(t|x;0) =1—([l+exp(xp)r7]"', 120 (22.28) 


The distribution of log(t;) given x; is logistic with mean —«~! log{exp(xf)} = 
—o-'xB and variance 2*/(3«?). Therefore, log(t*) can be written as in equation 
(22.26) where e; has a zero mean logistic distribution and is independent of x; and 
6 = —a'. This is another example where the effects of the covariates on the mean 
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duration can be obtained by an OLS regression when there is no censoring. With 
censoring, the distribution of e; must be accounted for using the log likelihood in 
expression (22.24). 


22.3.3 Stock Sampling 


Flow data with right censoring are common, but other sampling schemes are also 
used. With stock sampling we randomly sample from individuals that are in the initial 
state at a given point in time. The population is again individuals who enter the ini- 
tial state during a specified interval, [0, b]. However, rather than observe a random 
sample of people flowing into the initial state, we can only obtain a random sample 
of individuals that are in the initial state at time b. In addition to the possibility of 
right censoring, we may also face the problem of left censoring, which occurs when 
some or all of the starting times, a;, are not observed. For now, we assume that (1) we 
observe the starting times a; for all individuals we sample at time b and (2) we can 
follow sampled individuals for a certain length of time after we observe them at time 
b. We also allow for right censoring. 

In the unemployment duration example, where the population comprises workers 
who became unemployed at some point during 1998, stock sampling would occur if 
we randomly sampled from workers who were unemployed during the last week of 
1998. This kind of sampling causes a clear sample selection problem: we necessarily 
exclude from our sample any individual whose unemployment spell ended before the 
last week of 1998. Because these spells were necessarily shorter than a year, we can- 
not just assume that the missing observations are randomly missing. 

The sample selection problem caused by stock sampling is essentially the same 
situation we faced in Section 19.5, where we covered the truncated regression model. 
Therefore, we will call this the left truncation problem. Kiefer (1988) calls it length- 
biased sampling. 

Under the assumptions that we observe the a; and can observe some spells past 
the sampling date b, left truncation is fairly easy to deal with. With the exception of 
replacing flow sampling with stock sampling, we make the same assumptions as in 
Section 22.3.2. 

To account for the truncated sampling, we must modify the density in equation 
(22.23) to reflect the fact that part of the population is systematically omitted from 
the sample. Let (a;, ci, X;, ti) denote a random draw from the population of all spells 
starting in [0,5]. We observe this vector if and only if the person is still in the initial 
state at time b, that is, if and only if a; + tř > b or tř > b — a;, where tř is the true 
duration. But, under the conditional independence assumption (22.22), 
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P(t; = b—a;| ai, ci, Xi) = 1 — F(b — a; | xi; 9) (22.29) 


where F(-|x;;0@) is the cdf of t* given x;, as before. The correct conditional density 
function is obtained by dividing equation (22.23) by equation (22.29). In Problem 
22.5 you are asked to adapt the arguments in Section 19.5 to also allow for right 
censoring. The log-likelihood function can be written as 


N 
Dy {dj log| f(4| xi; )| + (1 — di) log[1 — F(t; |x;;0)] — log[1 — F(b — a;i | xi; 0)]} 
E (22.30) 


where, again, t; = c; when d; = 0. Unlike in the case of flow sampling, with stock 
sampling both the starting dates, a;, and the length of the sampling interval, b, appear 
in the conditional likelihood function. Their presence makes it clear that specifying 
the interval [0, b] is important for analyzing stock data. [Lancaster (1990, p. 183) es- 
sentially derives equation (22.30) under a slightly different sampling scheme; see also 
Lancaster (1979).] 

Equation (22.30) has an interesting implication. If observation i is right censored at 
calendar date —that is, if we do not follow the spell after the initial data collection— 
then the censoring time is c; = b — a;. Because d; = 0 for censored observations, the log 
likelihood for such an observation is log[1 — F (c; | x;;0)] — log|1 — F(b — a;i | xj; 0)] = 
0. In other words, observations that are right censored at the data collection time 
provide no information for estimating 0, at least when we use equation (22.30). 
Consequently, the log likelihood in equation (22.30) does not identify 6 if all units are 
right censored at the interview date: equation (22.30) is identically zero. The intuition 
for why equation (22.30) fails in this case is fairly clear: our data consist only of 
(aj, X;), and equation (22.30) is a log likelihood that is conditional on (a;, x;). Effec- 
tively, there is no random response variable. 

Even when we censor all observed durations at the interview date, we can still es- 
timate 0, provided—at least in a parametric context—we specify a model for the 
conditional distribution of the starting times, D(a; | x;). (This is essentially the prob- 
lem analyzed by Nickell, 1979.) We are still assuming that we observe the a;. So, for 
example, we randomly sample from the pool of people unemployed in the last week 
of 1998 and find out when their unemployment spells began (along with covariates). 
We do not follow any spells past the interview date. (As an aside, if we sample un- 
employed people during the last week of 1998, we are likely to obtain some obser- 
vations where spells began before 1998. For the population we have specified, these 
people would simply be discarded. If we want to include people whose spells began 
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prior to 1998, we need to redefine the interval. For example, if durations are mea- 
sured in weeks and if we want to consider durations beginning in the five-year period 
prior to the end of 1998, then b = 260.) 

For concreteness, we assume that D(a;|x;) is continuous on [0,5] with density 
k(-|xi;). Let s; denote a sample selection indicator, which is unity if we observe 
random draw i, that is, if ¢} > b — a;. Estimation of 0 (and 4) can proceed by apply- 
ing CMLE to the density of a; conditional on x; and s; = 1. (Note that this is the only 
density we can hope to estimate, as our sample only consists of observations (a;, x;) 
when s; = 1.) This density is informative for 0 even if y is not functionally related to 0 
(as would typically be assumed) because there are some durations that started and 
ended in [0, 5]; we simply do not observe them. Knowing something about the start- 
ing time distribution gives us information about the duration distribution. (In the 
context of flow sampling, when y is not functionally related to 0, the density of a; 
given x; is uninformative for estimating 0; in other words, a; is ancillary for 0.) 

In Problem 22.6 you are asked to show that the density of a; conditional on 
observing (a;, X;) is 


p(a| xi, si = 1) = k(a |x; m)[1 — F(b — a |x; 0)]/P(s; = 1 |x; 0,1) (22.31) 


0<ax< b, where 
b 

P(s; = 1 |x; 0,4) = | [1 — F(b — «| xi; 0)|k(a| xi;9) dz (22.32) 
0 


(Lancaster (1990, Section 8.3.3) essentially obtains the right-hand side of equation 
(22.31) but uses the notion of backward recurrence time. The argument in Problem 
22.6 is more straightforward because it is based on a standard truncation argument.) 
Once we have specified the duration cdf, F, and the starting time density, k, we can 
use conditional MLE to estimate @ and y: the log likelihood for observation / is just 
the log of equation (22.31), evaluated at a;. If we assume that a; is independent of 
x; and has a uniform distribution on [0,5], the estimation simplifies somewhat; see 
Problem 22.6. Allowing for a discontinuous starting time density k(-|x;;7) does not 
materially affect equation (22.31). For example, if the interval [0,1] represents one 
year, we might want to allow different entry rates over the different seasons. This 
approach would correspond to a uniform distribution over each subinterval that we 
choose. 

We now turn to the problem of left censoring, which arises with stock sampling 
when we do not actually know when any spell began. In other words, the a; are not 
observed, and therefore neither are the true durations, t¥. However, we assume that 
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we can follow spells after the interview date. Without right censoring, this assump- 
tion means we can observe the time in the current spell since the interview date, say, 
ri, Which we can write as r; = tř +a; — b. We still have a left truncation problem 
because we only observe r; when ¢7 > b—a;, that is, when r; > 0. The general 
approach is the same as with the earlier problems: we obtain the density of the vari- 
able that we can at least partially observe, 7; in this case, conditional on observing 
ri. Problem 22.8 asks you to fill in the details, accounting also for possible right 
censoring. 

We can easily combine stock sampling and flow sampling. For example, in the 
case that we observe the starting times, a;, suppose that, at time m < b, we sample a 
stock of individuals already in the initial state. In addition to following spells of 
individuals already in the initial state, suppose we can randomly sample individuals 
flowing into the initial state between times m and b. Then we follow all the individ- 
uals appearing in the sample, at least until right censoring. For starting dates after 
m (a; > m), there is no truncation, and so the log likelihood for these observations is 
just as in equation (22.24). For a; < m, the log likelihood is identical to equation 
(22.30) except that m replaces b. Other combinations are easy to infer from the pre- 
ceding results. 


22.3.4 Unobserved Heterogeneity 


One way to obtain more general duration models is to introduce unobserved hetero- 
geneity into fairly simple duration models. In addition, we sometimes want to test for 
duration dependence conditional on observed covariates and unobserved heteroge- 
neity. The key assumptions used in most models that incorporate unobserved heter- 
ogeneity are that (1) the heterogeneity is independent of the observed covariates, as 
well as starting times and censoring times; (2) the heterogeneity has a distribution 
known up to a finite number of parameters; and (3) the heterogeneity enters the 
hazard function multiplicatively. We will make these assumptions. In the context of 
single-spell flow data, it is difficult to relax any of these assumptions. (In the special 
case of a lognormal duration distribution, we can relax assumption | by using Tobit 
methods with endogenous explanatory variables; see Section 17.5.2.) 

In some fields, particularly those concerned with modeling survival times (such as 
biostatistics), the unobserved heterogeneity is called frailty. Then the hazard condi- 
tional on the unobserved frailty is the instantaneous probability of dying conditional 
on surviving up through time ¢ for an individual with a given frailty. In this chapter, 
we usually use the term heterogeneity. Often, a model that explicitly introduces het- 
erogeneity into a hazard function is called a mixture model. 
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Before we cover the general case, it is useful to cover an example due to Lancaster 
(1979). For a random draw i from the population, a Weibull hazard function condi- 
tional on observed covariates x; and unobserved heterogeneity v; is 


A(t; Xi, vi) = v; exp(x,p)at*! (22.33) 


where x; = 1 and v; > 0. Lancaster (1990) calls equation (22.33) a conditional haz- 
ard, because it conditions on the unobserved heterogeneity (or frailty) v;. Technically, 
almost all hazards in econometrics are conditional because we almost always condi- 
tion on observed covariates. Notice how v; enters equation (22.33) multiplicatively. 
To identify the parameters « and $ we need a normalization on the distribution of v;; 
we use the most common, E(v;) = 1. This implies that, for a given vector x, the 
average hazard is exp(xf)ar*'. An interesting hypothesis is Ho: « = 1, which means 
that, conditional on x; and v;, there is no duration dependence. 

In the general case where the cdf of ¢7 given (xj, v;) is F(t |X; v;;0), we can obtain 
the distribution of f; given x; by integrating out the unobserved effect. Because v; and 
x; are independent, the cdf of t* given x; is 


G(t|x;:0,p) = | -F(t|x;,0;0)h(v; p) dv (22.34) 
0 


where, for concreteness, the density of v;, h(-;p), is assumed to be continuous and 
depends on the unknown parameters p. From equation (22.34) the density of ¢* given 
x;, g(t|x;;0,p), is easily obtained. We can now use the methods of Sections 22.3.2 
and 22.3.3. For flow data, the log-likelihood function is as in equation (22.24), but 
with G(t|x;;0,p) replacing F(t|x;;0) and g(t|x;;0,p) replacing f(t|x;;0). We 
should assume that D(¢7 | x;, vi, aj, ci) = D(t} |x; vi) and D(v;|x;, ai, ci) = D(v;); 
these assumptions ensure that the key condition (22.22) holds. The methods for stock 
sampling described in Section 22.3.3 also apply to the integrated cdf and density. 

If we assume gamma-distributed heterogeneity—that is, v; ~ Gamma(0,0), so that 
E(v;) = 1 and Var(v;) = 1/d—we can find the distribution of ¢* given x; for a broad 
class of hazard functions with multiplicative heterogeneity. Suppose that the hazard 
function is A(t; x;,v;) = v(t; x;), where «(t;x) > 0 (and need not have the propor- 
tional hazard form). For simplicity, we suppress the dependence of x(-;-) on un- 
known parameters. From equation (22.7), the cdf of t} given (x;, v;) is 


F(t|x;,v;) = 1 — exp -v i (8; X;) as| = | — exp|—v;č(t; x;)] (22.35) 


where ¢(t; x;) = is k(s;x;) ds. We can obtain the cdf of 77 given x; by using equation 
(22.34). The density of v; is h(v) = °v?! exp(—ôv)/T (ô), where Var(v;) = 1/6 and 
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T(-) is the gamma function. Let č; = €(t;x;) for given t. Then 


| ‘exp(—€,)6°»! exp(—dv) /T(6) dv 
0 


= 6/64 E]? | (5 + &)°v™ exp[—(6 + ENTO ao 


= (5/6 + &))? = (1 + &/8)° 


where the second-to-last equality follows because the integrand is the Gamma (0, 
Ô + č;) density and must integrate to unity. Now we use equation (22.34): 


G(t|x;) =1— [1 + €(t;x;)/o}° (22.36) 


Taking the derivative of equation (22.36) with respect to ¢, using the fact that x(t; x;) 
is the derivative of €(t;x;), yields the density of t; given x; as 


g(t| xi) = K(f xDU + El x) (22.37) 


The function x(t; x) depends on parameters 0, and so g(t|x) should be g(t| x; 6,0). 
With censored data the vector 0 can be estimated along with ô by using the log- 
likelihood function in equation (22.24) (again, with G replacing F). 

The hazard associated with the density g(t| x) is typically called the unconditional 
hazard because it does not condition on the unobserved heterogeneity (but, as always, 
does condition on the observed covariates). It is useful to think of the unconditional 
hazard as something that can be estimated quite generally because it involves the 
distribution of an observed outcome conditional on observed covariates. In fact, 
without censoring, estimation is easily done without making parametric restrictions 
and without even introducing the notion of unobserved heterogeneity. 

When the hazard function has the Weibull form in equation (22.33), €(t;x) = 
exp(xf)t*, which leads to a very tractable analysis when plugged into equations 
(22.36) and (22.37). The resulting duration distribution is called the Burr distribution. 
Its hazard function—that is, the unconditional hazard function when the condi- 
tional hazard is Weibull and the heterogeneity has a gamma distribution—is 
exp(xf)at*![1 + exp(xf)r“/d] |. It is useful to reparameterize the Burr distribution 
by letting 7 = 1/6 and then writing the hazard as 


exp(xf)at*'/[1 + 9 exp(x£)t"] 


Then 7 = 0—that is, Var(v;) = 0—leads to the Weibull hazard, as expected. Further, 
ny =1 (so that Var(v;) = E(v;)) gives the log-logistic hazard in equation (22.27). 
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Therefore, even if we ignore that we derived the Burr distribution using gamma 
heterogeneity, we see that it nests two important special cases. 

If we use a log-logistic hazard for x(t; x)—which, recall, implies that x(t; x) is not 
multiplicatively separable in £ and x—and assume gamma heterogeneity, then we 
plug x«(t;x) = exp(xP)at*![1 + exp(xp)r]' and é(¢;x) = log[1 + exp(xf)r] into 
equation (22.37) to obtain the density (as a function of the parameters £, «, and ô); we 
plug these expressions into equation (22.36) to obtain the cdf. Again, maximum 
likelihood estimation with right censoring is fairly straightforward. 

Before presenting an example, we should recall why we might want to explicitly 
introduce unobserved heterogeneity when the heterogeneity is assumed to be inde- 
pendent of the observed covariates. The strongest case is seen when we are interested 
in testing for duration dependence conditional on observed covariates and unob- 
served heterogeneity, where the unobserved heterogeneity enters the hazard multi- 
plicatively. As carefully exposited by Lancaster (1990, Section 10.2), ignoring 
multiplicative heterogeneity in the Weibull model results in asymptotically under- 
estimating «. Therefore, we could very well conclude that there is negative duration 
dependence conditional on x, whereas there is no duration dependence (« = 1) con- 
ditional on x and v or even positive duration dependence (« > 1). 

In a general sense, it is somewhat heroic to think we can distinguish between dura- 
tion dependence and unobserved heterogeneity when we observe only a single cycle 
for each agent. The problem is simple to describe: because we can only estimate the 
distribution of T given x, we cannot uncover the distribution of T given (x, v) unless 
we make extra assumptions, a point Lancaster (1990, Section 10.1) illustrates with an 
example. Therefore, we cannot tell whether the hazard describing T given (x, v) 
exhibits duration dependence. But, when the hazard has the proportional hazard 
form A(t; x, v) = vi(x)Ao(t), it is possible to identify the function x(-) and the baseline 
hazard Ao(-) quite generally (along with the distribution of v). See Lancaster (1990, 
Section 7.3) for a presentation of the results of Elbers and Ridder (1982). More 
recently, Horowitz (1999) demonstrated how to nonparametrically estimate the 
baseline hazard and the distribution of the unobserved heterogeneity under fairly 
weak assumptions. 

When interest centers on how the observed covariates affect the mean duration, 
explicitly modeling unobserved heterogeneity is less compelling. Adding unobserved 
heterogeneity to equation (22.26) does not change the mean effects; it merely changes 
the error distribution. Without censoring, we would probably estimate f# in equation 
(22.26) by OLS (rather than MLE) so that the estimators would be robust to dis- 
tributional misspecification. With censoring, to perform maximum likelihood, we 
must know the distribution of ¢7 given x;, and this depends on the distribution of v; 
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Table 22.2 
Weibull Hazard with Gamma Heterogeneity, Criminal Recidivism 
Explanatory Coefficient 
Variable (Standard Error) 
workprg .007 
.204 
priors 243 
.042 
tserved 035 
.007 
felon 791 
.267 
alcohol 1.174 
.281 
drugs .285 
.223 
black 112 
.204 
married .806 
.258 
educ .027 
.045 
age .005 
.001 
constant —5.394 
.720 
â 1.708 
(.162) 
ô 5.991 
(1.071) 
Observations 1,445 
Log likelihood —1,584.92 


when we explicitly introduce unobserved heterogeneity. But again introducing un- 
observed heterogeneity is indistinguishable from simply allowing a more flexible 
duration distribution. 

We now apply a model with a Weibull baseline hazard and gamma heterogeneity 
to the recidivism data. Individuals with a larger heterogeneity, v;, have a higher 
probability of being arrested after release from prison in every interval, conditional 
on “surviving” up to that point. 

The maximum likelihood estimates are given in Table 22.2. Notice how the esti- 
mate of «, 1.708, is well above unity. Therefore, the conditional hazard exhibits 
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Figure 22.3 
Conditional Weibull hazard 


positive duration dependence: for a given individual, the instantaneous probability of 
being arrested after prison release monotonically increases with the time out of pris- 
on. The situation is displayed in Figure 22.3. By contrast, in the Weibull model 
without heterogeneity, we estimated « to be .806. 

Allowing for heterogeneity effects an important change on the shape of the hazard. 
The qualitative effects of the covariates are similar to those in Table 22.1, although 
the magnitudes of some of the variables—for example, the number of priors, the 
felon dummy variable, and the marriage dummy variable—increase nontrivially. The 
estimate of 6 is about 6.0, and it is very statistically significant. Thus the Weibull 
model without heterogeneity is strongly rejected in favor of the Weibull model with 
heterogeneity. 

The unconditional hazard—that is, the hazard for the duration distribution that 
integrates out the heterogeneity—is plotted in Figure 22.4. The covariates are set 
at mean values. Its shape is more like that in the lognormal model, although the peak 
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Figure 22.4 
Unconditional Burr hazard 


of the hazard in Figure 22.4 is at roughly 12 months, rather than six months. The 
monotonicity of the conditional hazard coupled with the hump shape of the un- 
conditional hazard is common in applications. The reasoning behind the hump shape 
for the unconditional hazard goes something like this. (For simplicity, assume that 
there are no observed covariates, or that we are conditioning on a set of values.) 
Initially, all types of men are in the risk set, and so the shape of the unconditional 
hazard—that is, the aggregate hazard across the entire population—mimics that of 
the conditional hazard. But in the early periods, men with high proclivities to commit 
crimes—that is, with higher v;j—will tend to be arrested. Those arrested drop out of 
the risk set in subsequent time periods. Therefore, the men left in the risk set in later 
time periods tend to have the lower predispositions to repeat crimes. As the length of 
time until arrest increases, men with smaller values of v; are left in the risk set, and 
this result translates into a declining unconditional hazard. 
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As with all applications that assume the proportional hazard form (in both 
observed covariates and unobserved heterogeneity), we are left with a conundrum: 
should we accept that the conditional hazard has the shape in Figure 22.3—for any 
heterogeneity and any value of x—or should the conditional hazard be more com- 
plicated? As mentioned earlier, unless we take the multiplicative structure 
A(t; Xi, vi) = vik(x;)40(f) as given, there is a fundamental lack of identification. If we 
replace «(x)Ao(t) with the general nonseparable function x(t;x), identification of 
x(-;x) for given values of x becomes very difficult. Further, more complicated ways 
of introducing heterogeneity lead into very rough waters. For example, suppose 
we specify 2(t;x;,b;) = exp(x;b;)40(t), where b; is a K x 1 random vector. (Letting 
xi = 1 encompasses the usual case of multiplicative heterogeneity.) Even if we 
assume b; is independent of x;, it is not clear that one can identify Ao(¢) without 
assuming a full distribution for b;. Even if we took such an approach, we would never 
be able to distinguish between relatively simple models with unobserved heterogene- 
ity and complicated hazards that cannot be written as a multiple of A (f). 


22.4 Analysis of Grouped Duration Data 


Continuously distributed durations are, strictly speaking, rare in social science appli- 
cations. Even if an underlying duration is properly viewed as being continuous, mea- 
surements are necessarily discrete. When the measurements are fairly precise, it is 
sensible to treat the durations as continuous random variables. But when the mea- 
surements are coarse—such as monthly, or perhaps even weekly—it can be impor- 
tant to account for the discreteness in the estimation. 

Grouped duration data arise when each duration is only known to fall into a certain 
time interval, such as a week, a month, or even a year. For example, unemployment 
durations are often measured to the nearest week. In Example 22.2 the time until next 
arrest is measured to the nearest month. Even with grouped data we can generally 
estimate the parameters of the duration distribution. 

The approach we take here to analyzing grouped data summarizes the information 
on staying in the initial state or exiting in each time interval in a sequence of binary 
outcomes. (Kiefer, 1988; Han and Hausman, 1990; Meyer, 1990; Lancaster, 1990; 
McCall, 1994; and Sueyoshi, 1995, all take this approach.) In effect, we have a panel 
data set where each cross section observation is a vector of binary responses, along 
with covariates. In addition to allowing us to treat grouped durations, the panel data 
approach has at least two additional advantages. First, in a proportional hazard 
specification, it leads to easy methods for estimating flexible hazard functions. Sec- 
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ond, because of the sequential nature of the data, time-varying covariates are easily 
introduced. 

We assume flow sampling so that we do not have to address the sample selection 
problem that arises with stock sampling. We divide the time line into M + 1 inter- 
vals, [0, a1), [a1,42),.--, [am-1, am), lam, ©), where the am are known constants. For 
example, we might have a; = 1,a) = 2,a3 = 3, and so on, but unequally spaced 
intervals are allowed. The last interval, [am, 00), is chosen so that any duration fall- 
ing into it is censored at ay: no observed durations are greater than ay. For a ran- 
dom draw from the population, let c,, be a binary censoring indicator equal to unity 
if the duration is censored in interval m, and zero otherwise. Notice that Cm = 1 
implies Cm+1 = 1: if the duration was censored in interval 7n, it is still censored in in- 
terval m + 1. Because durations lasting into the last interval are censored, cy+) = 1. 
Similarly, y,, is a binary indicator equal to unity if the duration ends in the mth in- 
terval and zero otherwise. Thus, y,,,; = 1 if y,, = 1. If the duration is censored in 
interval m (Cm = 1), we set y„ = 1 by convention. 

As in Section 22.3, we allow individuals to enter the initial state at different calen- 
dar times. In order to keep the notation simple, we do not explicitly show the con- 
ditioning on these starting times, as the starting times play no role under flow 
sampling when we assume that, conditional on the covariates, the starting times are 
independent of the duration and any unobserved heterogeneity. If necessary, starting- 
time dummies can be included in the covariates. 

For each person i, we observe (y;1,¢i1),---; (Yim, Cim), Which is a balanced panel 
data set. To avoid confusion with our notation for a duration (T for the random 
variable, ¢ for a particular outcome on T), we use m to index the time intervals. The 
string of binary indicators for any individual is not unrestricted: we must observe a 
string of zeros followed by a string of ones. The important information is the interval 
in which y,,, becomes unity for the first time, and whether that represents a true exit 
from the initial state or censoring. 


22.4.1 Time-Invariant Covariates 


With time-invariant covariates, each random draw from the population consists of 
information on {(¥),¢1),---;(¥y,cm),x}. We assume that a parametric hazard 
function is specified as 4(t; x, 0), where 8 is the vector of unknown parameters. Let T 
denote the time until exit from the initial state. While we do not fully observe 7, 
either we know which interval it falls into, or we know whether it was censored in a 
particular interval. This knowledge is enough to obtain the probability that y,, takes 
on the value unity given (y,,_1,---,1); (Gn;---,¢1), and x. In fact, by definition this 
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probability depends only on Y, m-i; Cm, and x, and only two combinations yield 
probabilities that are not identically zero or one. These probabilities are 


PC Vm = 0| Ym—1 = 9,%, Cm = 0) (22.38) 
P( Vin = 1 | Ym-1 = 0, x, Cn = 0), m= 1, BSUS M (22.39) 


(We define yọ =0 so that these equations hold for all m > 1.) To compute these 
probabilities in terms of the hazard for T, we assume that the duration is condition- 
ally independent of censoring: 


T is independent of c1,...,Cm, given x (22.40) 


This assumption allows the censoring to depend on x but rules out censoring that 
depends on unobservables, after conditioning on x. Condition (22.40) holds for fixed 
censoring or completely randomized censoring. (It may not hold if censoring is due to 
nonrandom attrition.) Under assumption (22.40) we have, from equation (22.9), 


P(Ym =1 | Ym-1 = 0,x, Cm = 0) = Pham- <T < am | T> am-1,X) 


=1 -p| -f , A(S; x.0) | = 1—-4,,(x,0) 


am-1 


(22.41) 
form = 1,2,..., M, where 
Am(X, 0) = exp -l ‘ A(s; x0) a| (22.42) 
Am-1 
Therefore, 
PU Pa = 0| Ym-1 = 0,x, Cn = 0) = Om (X, 0) (22.43) 


We can use these probabilities to construct the likelihood function. If, for observation 
i, uncensored exit occurs in interval m;, the likelihood is 


m;—1 
| an(Xi, 0) [1 — om, (xi, )] (22.44) 
A=] 


The first term represents the probability of remaining in the initial state for the first 
mi; — | intervals, and the second term is the (conditional) probability that T falls into 
interval m;. (Because an uncensored duration must have m; < M, expression (22.44) 


at most depends on «)(x;,0),...,%(x;,9).) If the duration is censored in interval 
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mi, we know only that exit did not occur in the first m; — 1 intervals, and the likeli- 
hood consists of only the first term in expression (22.44). 

If d; is a censoring indicator equal to one if duration i is uncensored, the log like- 
lihood for observation 7 can be written as 


m—1 


XC loglan(xi,0)] + di log[l — on, (x;, 8)] (22.45) 
h=1 


The log likelihood for the entire sample is obtained by summing expression (22.45) 
across all i=1,...,N. Under the assumptions made, this log likelihood represents 
the density of (y1,..., Yy) given (c1,...,¢y) and x, and so the conditional maxi- 
mum likelihood theory covered in Chapter 13 applies directly. The various ways of 
estimating asymptotic variances and computing test statistics are available. 

To implement conditional MLE, we must specify a hazard function. One hazard 
function that has become popular because of its flexibility is a piecewise-constant 
proportional hazard: for m = 1,..., M, 


A(t; x, 0) = K(x, B)Am, Ami <t < Am (22.46) 


where «(x,f) > 0 (and typically x(x, B) = exp(xf)). This specification allows the 
hazard to be different (albeit constant) over each time interval. The parameters to be 
estimated are $ and å, where the latter is the vector of Am, m = 1,..., M. (Because 
durations in [am, œ) are censored at ay, we cannot estimate the hazard over the 
interval [ay,0©).) As an example, if we have unemployment duration measured in 
weeks, the hazard can be different in each week. If the durations are sparse, we might 
assume a different hazard rate for every two or three weeks (this assumption places 
restrictions on the /,,). With the piecewise-constant hazard and x(x, f) = exp(xf), 
form = 1,..., M, we have 


Am(X, 0) = exp[—exp(xB)Am(m — am-1)] (22.47) 


Remember, the am are known constants (often am =m) and not parameters to 
be estimated. Usually the åm are unrestricted, in which case x does not contain an 
intercept. 

The piecewise-constant hazard implies that the duration distribution is discontin- 
uous at the endpoints, whereas in our discussion in Section 22.2 we assumed that the 
duration had a continuous distribution. A piecewise-continuous distribution causes 
no real problems, and the log likelihood is exactly as specified previously. Alter- 
natively, as in Han and Hausman (1990) and Meyer (1990), we can assume that T 
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has a proportional hazard as in equation (22.16) with continuous baseline hazard, 
2o(-). Then, we can estimate $ along with the parameters 


Am 
| Ao(s) ds, m=1,2,...,M 
am-1 

In practice, the approaches are the same, and it is easiest to just assume a piecewise- 
constant proportional hazard, as in equation (22.46). 

Once the Am have been estimated along with f, an estimated hazard function is 
easily plotted: graph fe, at the midpoint of the interval [am-1, 4m), and connect the 
points. 

Without covariates, maximum likelihood estimation of the Am leads to a well- 
known estimator of the survivor function. Rather than derive the MLE of the survi- 
vor function, it is easier to motivate the estimator from the representation of the 
survivor function as a product of conditional probabilities. For m = 1,..., M, the 
survivor function at time am can be written as 

m 
S(am) = P(T > am) = [| [| P(T > a| T > an1) (22.48) 
r=1 
(Because ao = 0 and P(T > 0) = 1, the r = 1 term on the right-hand side of equation 
(22.48) is simply P(T > aı).) Now, for each r = 1,2,...,M, let N, denote the number 
of people in the risk set for interval r: N, is the number of people who have neither 
left the initial state nor been censored at time a,_;, which is the beginning of interval 
r. Therefore, N; is the number of individuals in the initial random sample; N> is the 
number of individuals who did not exit the initial state in the first interval, less the 
number of individuals censored in the first interval; and so on. Let £, be the number 
of people observed to leave in the rth interval—that is, in the interval [a,-1,a,). A 
consistent estimator of P(T > a |T > a,-1) is (N, — E,)/N,;, r= 1,2,..., M. (We 
must use the fact that the censoring is ignorable in the sense of assumption (22.40), so 
that there is no sample selection bias in using only the uncensored observations.) It 
follows from equation (22.48) that a consistent estimator of the survivor function at 
time a, is 
a m 
Slam) = [[ (N; - E/N],  m=1,2,..., M (22.49) 
r=1 

This is the Kaplan-Meier estimator of the survivor function (at the points a, a2, 
..,@y). Lancaster (1990, Section 8.2) contains a proof that maximum likelihood 
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estimation of the Am (without covariates) leads to the Kaplan-Meier estimator of the 
survivor function. If there are no censored durations before time am, Slan) is simply 
the fraction of people who have not left the initial state at time am, which is obviously 
consistent for P(T > am) = S(am). 

In the general model, we do not need to assume a proportional hazard specification 
within each interval. For example, we could assume a log-logistic hazard within each 
interval, with different parameters for each m. Because the hazard in such cases does 
not depend on the covariates multiplicatively, we must plug in values of x in order to 
plot the hazard. Sueyoshi (1995) studies such models in detail. 

If the intervals [am-1, 4m) are coarser than the data—for example, unemployment 
is measured in weeks, but we choose [am-1, 4m) to be four weeks for all m—then we 
can specify nonconstant hazards within each interval. The piecewise-constant hazard 
corresponds to an exponential distribution within each interval. But we could specify, 
say, a Weibull distribution within each interval. See Sueyoshi (1995) for details. 


22.4.2 Time-Varying Covariates 


Deriving the log likelihood is more complicated with time-varying covariates, espe- 
cially when we do not assume that the covariates are strictly exogenous. Nevertheless, 
we will show that, if the covariates are constant within each time interval [am-1, 4m), 
the form of the log likelihood is the same as expression (22.45), provided x; is 
replaced with X;m in interval m. 

For the population, let x,,x2,...,xj,y denote the outcomes of the covariates in 
each of the M time intervals, where we assume that the covariates are constant within 
an interval. This assumption is clearly an oversimplification, but we cannot get very 
far without it (and it reflects how data sets with time-varying covariates are usually 
constructed). When the covariates are internal and are not necessarily defined after 
exit from the initial state, the definition of the covariates in the time intervals is 
irrelevant; but it is useful to list covariates for all M time periods. 

We assume that the hazard at time ¢ conditional on the covariates up through time 
t depends only on the covariates at time t. If past values of the covariates matter, they 
can simply be included in the covariates at time ¢. The conditional independence 
assumption on the censoring indicators is now stated as 


D(T | T > am—1,Xm,€m) = D(T | T = am_1, Xm), m=1,...,M (22.50) 


This assumption allows the censoring decision to depend on the covariates during the 
time interval (as well as past covariates, provided they are either included in x,, or do 
not affect the distribution of T given x,,). Under this assumption, the probability of 
exit (without censoring) is 
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Pm =1 | Ym- = 0,Xm, Cm = 0) E P(am-1 < T < am | T = Am-1;Xm) 


= | — exp |- ” A(s; Xn) | = 1 — a, (Xm, 0) 
i (22.51) 


We can use equation (22.51), along with P(y,, =0 | Ym-1 =0, Xm, Cn = 0) = %m(Xm, 0), 
to build up a partial log likelihood for person i. As we discussed in Section 13.8, this 
is only a partial likelihood because we are not necessarily modeling the joint distri- 
bution of (y1, ..-, Ym) given {(x1,¢1),---, (Cm, Xm)}- 

For someone censored in interval m, the information on the duration is contained 
in ya =0,..., Yim-1 = 0. For someone who truly exits in interval m, there is addi- 
tional information in VY; = 1. Therefore, the partial log likelihood is given by ex- 
pression (22.45), but, to reflect the time-varying covariates, «(x;,0) is replaced by 
on (Xin, 0) and dm, (X;, 0) is replaced by %n,(Xi,m,,4). 

Each term in the partial log likelihood represents the distribution of y,, given 
(Vn-toee eo V1) (Xm, ---,X1), and (Gn,...,¢1). (Most of the probabilities in this con- 
ditional distribution are either zero or one; only the probabilities that depend on 0 
are shown in expression (22.45).) Therefore, the density is dynamically complete, in 
the terminology of Section 13.8.3. As shown there, the usual maximum likelihood 
variance matrix estimators and statistics are asymptotically valid, even though we 
need not have the full conditional distribution of y given (x,c). This result would 
change if, for some reason, we chose not to include past covariates when in fact they 
affect the current probability of exit even after conditioning on the current covariates. 
Then the robust forms of the statistics covered in Section 13.8 should be used. In 
most duration applications we want dynamic completeness. 

If the covariates are strictly exogenous and if the censoring is strictly exogenous, 
then the partial likelihood is the full conditional likelihood. The precise strict exoge- 
neity assumption is 


D(T | T = am_-1,x,¢) = D(T |T > am-1,Xm), m=1,...,M (22.52) 


where x is the vector of covariates across all time periods and ¢ is the vector of cen- 
soring indicators. There are two parts to this assumption. Ignoring the censoring, 
assumption (22.52) means that neither future nor past covariates appear in the haz- 
ard, once current covariates are controlled for. The second implication of assumption 
(22.52) is that the censoring is also strictly exogenous. 

With time-varying covariates, the hazard specification 


A(t; Xm, 0) = K(Xm,B)Am, Am-1 St< am (22.53) 
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m=1,...,M, is still attractive. It implies that the covariates have a multiplicative 
effect in each time interval, and it allows the baseline hazard—the part common to 
all members of the population—to be flexible. 

Meyer (1990) essentially uses the specification (22.53) to estimate the effect of un- 
employment insurance on unemployment spells. McCall (1994) shows how to allow 
for time-varying coefficients when «(x,,, 8) = exp(X,f). In other words, £ is replaced 
with fa, m= 1,..., M. 


22.4.3 Unobserved Heterogeneity 


We can also add unobserved heterogeneity to hazards specified for grouped data, 
even if we have time-varying covariates. With time-varying covariates and unob- 
served heterogeneity, it is difficult to relax the strict exogeneity assumption. Also, 
with single-spell data, we cannot allow general correlation between the unobserved 
heterogeneity and the covariates. Therefore, we assume that the covariates are strictly 
exogenous conditional on unobserved heterogeneity and that the unobserved hetero- 
geneity is independent of the covariates. 

The precise assumptions are given by equation (22.52) but where unobserved het- 
erogeneity, v, appears in both conditioning sets. In addition, we assume that v is in- 
dependent of (x, c) (which is a further sense in which the censoring is exogenous). 

In the leading case of the piecewise-constant baseline hazard, equation (22.53) 
becomes 


A(t; U, Xm, 0) = UK(Xm; B)Am, Am-1 Lis am (22.54) 

where v > 0 is a continuously distributed heterogeneity term. Using the same rea- 

soning as in Sections 22.4.1 and 22.4.2, the density of (ya, ---, Yim) given (vi, Xi, €;) is 
mi— l 

| II on (Vi, Xih, o) [1 = Am; (Vi, Xi mis 0)’ (22.55) 
h=1 


where d; = 1 if observation i is uncensored. Because expression (22.55) depends on 
the unobserved heterogeneity, v;, we cannot use it directly to consistently estimate 0. 
However, because v; is independent of (x;,¢;), with density g(v; ô), we can integrate 
expression (22.55) against g(- ; ô) to obtain the density of (y,,..., Yim) given (x;,¢;). 
This density depends on the observed data— (m;, di, x;)—and the parameters 0 and ô. 
From this density, we construct the conditional log likelihood for observation i, and 
we can obtain the conditional MLE, just as in other nonlinear models with unob- 
served heterogeneity—see Chapters 15-19. Meyer (1990) assumes that the distribu- 
tion of v; is gamma, with unit mean, and obtains the log-likelihood function in closed 
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form. McCall (1994) analyzes a heterogeneity distribution that contains the gamma 
as a special case. 

It is possible to consistently estimate f and A without specifying a parametric form 
for the heterogeneity distribution; this approach results in a semiparametric maxi- 
mum likelihood estimator. Heckman and Singer (1984) first showed how to perform 
this method with a Weibull baseline hazard, and Meyer (1990) proved consistency 
when the hazard has the form (22.54). The estimated heterogeneity distribution is 
discrete and, in practice, has relatively few mass points. The consistency argument 
works by allowing the number of mass points to increase with the sample size. 
Computation is a difficult issue, and the asymptotic distribution of the semiparametric 
maximum likelihood estimator has not been worked out. 


22.5 Further Issues 


The methods we have covered in this chapter have been applied in many contexts. 
Nevertheless, there are several important topics that we have neglected. 


22.5.1 Cox’s Partial Likelihood Method for the Proportional Hazard Model 


Cox (1972) suggested a partial likelihood method for estimating the parameters f# in a 
proportional hazard model without specifying the baseline hazard. The strength of 
Cox’s approach is that the effects of the covariates can be estimated very generally, 
provided the hazard is of the form (22.16). However, Cox’s method is intended to be 
applied to flow data as opposed to grouped data. If we apply Cox’s methods to 
grouped data, we must confront the practically important issue of individuals 
with identical observed durations. In addition, with time-varying covariates, Cox’s 
method evidently requires the covariates to be strictly exogenous. Estimation of the 
hazard function itself is more complicated than the methods for grouped data that 
we covered in Section 22.4. See Amemiya (1985, Chapter 11) and Lancaster (1990, 
Chapter 9) for treatments of Cox’s partial likelihood estimator. 


22.5.2 Multiple-Spell Data 


All the methods we have covered assume a single spell for each sample unit. In other 
words, each individual begins in the initial state and then either is observed leaving 
the state or is censored. But at least some individuals might have multiple spells, 
especially if we follow them for long periods. For example, we may observe a person 
who is initially unemployed, becomes employed, and then after a time becomes 
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unemployed again. If we assume constancy across time about the process driving 
unemployment duration, we can use multiple spells to aid in identification, particu- 
larly in models with heterogeneity that can be correlated with time-varying covari- 
ates. Chamberlain (1985) and Honoré (1993b) contain identification results when 
multiple spells are observed. Chamberlain allowed for correlation between the het- 
erogeneity and the time-varying covariates. 

Multiple-spell data are also useful for estimating models with unobserved hetero- 
geneity when the regressors are not strictly exogenous. Ham and Lalonde (1996) give 
an example in which participation in a job training program can be related to past 
unemployment duration, even though eligibility is randomly assigned. See also 
Wooldridge (2000) for a general framework that allows feedback to future explana- 
tory variables in models with unobserved heterogeneity. 


22.5.3 Competing Risks Models 


Another important topic is allowing for more than two possible states. Competing 
risks models allow for the possibility that an individual may exit into different alter- 
natives. For example, a person working full-time may choose to retire completely or 
work part-time. Han and Hausman (1990) and Sueyoshi (1992) contain discussions 
of the assumptions needed to estimate competing risks models, with and without 
unobserved heterogeneity. See van den Berg (2001) for a detailed treatment. 


Problems 


22.1. Use the data in RECID.RAW for this problem. 


a. Using the covariates in Table 22.1, estimate equation (22.26) by censored Tobit. 
Verify that the log-likelihood value is —1,597.06. 


b. Plug in the mean values for priors, tserved, educ, and age, and the values 
workprg = 0, felon = 1, alcohol = 1, drugs = 1, black = 0, and married = 0, and 
plot the estimated hazard for the lognormal distribution. Describe what you find. 

c. Using only the uncensored observations, perform an OLS regression of log(durat) 
on the covariates in Table 22.1. Compare the estimates on tserved and alcohol with 
those from part a. What do you conclude? 

d. Now compute an OLS regression using all data—that is, treat the censored 
observations as if they are uncensored. Compare the estimates on tserved and alcohol 
from those in parts a and c. 
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22.2. Use the data in RECID.RAW to answer these questions: 


a. To the Weibull model, add the variables super (=1 if release from prison was 
supervised) and rules (number of rules violations while in prison). Do the coeff- 
cient estimates on these new variables have the expected signs? Are they statistically 
significant? 

b. Add super and rules to the lognormal model, and answer the same questions as in 
part a. 

c. Compare the estimated effects of the rules variable on the expected duration for 
the Weibull and lognormal models. Are they practically different? 


22.3. Consider the case of flow sampling, as in Section 22.3.2, but suppose that all 
durations are censored: d; = 1, i= 1,..., N. 


a. Write down the log-likelihood function when all durations are censored. 
b. Find the special case of the Weibull distribution in part a. 


c. Consider the Weibull case where x; only contains a constant, so that F(t; «, p) = 
1 — exp|—exp()t*]. Show that the Weibull log likelihood cannot be maximized for 


real numbers f and «@. 


d. From part c, what do you conclude about estimating duration models from flow 
data when all durations are right censored? 


e. If the duration distribution is continuous, c; > b > 0 for some constant b, and 
P(t? < t) > 0 for all ¢ > 0, is it likely, in a large random sample, to find that all 
durations have been censored? 


22.4. Suppose that, in the context of flow sampling, we observe for each i covariates 
x;, the censoring time c;, and the binary indicator d; (=1 if the observation is uncen- 
sored). We never observe tř. 


a. Show that the conditional likelihood function has the binary response form. What 
is the binary “response”? 


b. Use the Weibull model to demonstrate the following when we only observe 
whether durations are censored: if the censoring times c; are constant, the parameters 
B and « are not identified. (Hint: Consider the same case as in Problem 22.3c, and 
show that the log likelihood depends only on the constant exp(f)c*, where c is the 
common censoring time.) 


c. Use the lognormal model to argue that, provided the c; vary across i in the popu- 
lation, the parameters are generally identified. (Hint: In the binary response model, 
what is the coefficient on log(c;)?) 
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22.5. In this problem you are to derive the log likelihood in equation (22.30). As- 
sume that c; > b — a; for all i, so that we always observe part of each spell after the 
sampling date, b. In what follows, we supress the parameter vector, 0. 

a. For b— a<t<G, show that P(t; <t|Xj, äi, Ci, Si = 1) = [F(t|x;) — F(b-a; | x;)|/ 
[1 = F(b = Oj | x;)]. 

b. Use part a to obtain the density of tř conditional on (X;,đi, Ci, si = 1) for 
b-a<t<q. 

c. Show that P(t; = Ci | Xj, di, Ci, Si = 1) = p = F(ci | x;)]/[1 = F(b = Qj | x;)]. 

d. Explain why parts b and c lead to equation (22.30). 


22.6. Consider the problem of stock sampling where we do not follow spells after 
the sampling date, b, as described in Section 22.3.3. Let F(-|x;) denote the cdf of t* 
given x;, and let k(-|x,;) denote the continuous density of a; given x;. We drop de- 
pendence on the parameters for most of the derivations. Assume that ¢7 and a; are 
independent conditional on x;. 


a. Let s; denote a selection indicator, so that s; = I(t? > b — a;). For any 0 < a < b, 
show that 


Plai < as = 1x) = f ke|xill = FO- a|xi)] de 
0 
b. Derive equation (22.32). (Hint: P(s; = 1|x;) = E(s;| xi) = E[E(s;| a, x;) | xi], and 
E(s; | di, Xx;) = P(t? >bh-a; | x;).) 

c. For 0<a<b, what is the cdf of a; given x; and s; = 1? Now derive equation 
(22.31). 


d. Take b= 1, and assume that the starting time distribution is uniform on {0, 1] 
(independent of x;). Find the density (22.31) in this case. 


e. For the setup in part d, assume that the duration cdf has the Weibull form, 
1 — exp[—exp(x;f)t%]. What is the log likelihood for observation i? 


22.7. Consider the original stock sampling problem that we covered in Section 
22.3.3. There, we derived the log likelihood (22.30) by conditioning on the starting 
times, a;. This approach is convenient because we do not have to specify a distribu- 
tion for the starting times. But suppose we have an acceptable model for ‘(- | x;; 7), 
the (continuous) density of a; given x;. Further, we maintain assumption (22.22) and 
assume D(a; | cj, x;) = D(a; | x;). 

a. Show that the log-likelihood function conditional on x;, which accounts for trun- 
cation, 1s 
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N 
S {di log| f(t: | xi; @)] + (1 — di) log[l — F(t; | xi; 4)] 


i=] 
+ log{k(ar|xis9)] — log{P(s; = 1 |x; 0, 1))} (22.56) 


where P(s; = 1 |x;;0, y) is given in equation (22.32). 
b. Discuss the trade-offs in using equation (22.30) or the log likelihood in (22.56). 


22.8. In the context of stock sampling, where we are interested in the population of 
durations starting in [0, b], suppose that we interview at date b, as usual, but we do 
not observe any starting times. (This assumption raises the issue of how we know 
individual i’s starting time is in the specified interval, [0, b]. We assume that the in- 
terval is defined to make this condition true for all 7.) Let r* = a; + tř — b, which can 
be interpreted as the calendar date at which the spell ends minus the interview date. 
Even without right censoring, we observe r* only if rž > 0, in which case r¥ is simply 
the time in the spell since the interview date, b. Assume that ¢* and a; are independent 
conditional on x;. 


a. Show that for r > 0, the density of rž given x; is 
b 

h(r|x;;0,4) = | k(u| xi) f(r +b- «| xi30) du 
0 


where, as before, k(a|x;;7) is the density of a; given x; and f(t|x;;0) is the duration 
density. 

b. Let q >0 be a fixed censoring time after the interview date, and define r; = 
min(7;*,g). Find P(r; = q | x;) in terms of the cdf of r*, say, H(r|x;; 6,7). 

c. Use parts a and b, along with equation (22.32), to show that the log likelihood 
conditional on observing (r;, x;) is 


d; log[h(r; | Xi; 0, n)| + (1 = d;) log[1 = A(r; | Xi; 0, n)| 


= oef [i — F(b — u |x; Okla | xin) ac} (22.57) 


where d; = 1 if observation i has not been right censored. 
d. Simplify the log likelihood from part c when b = 1 and k(a|x;;y) is the uniform 
density on [0, 1]. 


22.9. Consider the Weibull model with multiplicative heterogeneity, as in equation 
(22.33), but where v; takes on only two values. Think of there being two types of 
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people, A and B. Let 0 < y < 1 be the value for type A people with p = P(v; = 1), 
O0<p<l. 


a. Show that to ensure E(v;)=1, the value of v; for type B people must be 
(1 — py)/(1 — p). 

b. Find the cdf of ¢7 conditional on x; only. Call this G(t| x;; «, B, 7, p). 

c. Find the density of t“ conditional on x;, g(t|x;;,8,1,p). How would you esti- 
mate the parameters if you have right-censored data? 

22.10. Let 0< a) <a. <-+-:<dy_1 <ay be a positive, increasing set of con- 
stants, and let T be a nonnegative random variable with P(T > 0) = 1. 

a. Show that, for any m = 1,..., M, P(T > am) = P(T > am| T > am-1)P(T > am-1). 
b. Use part a to derive equation (22.48). 


22.11. Use the data in RECID to answer the following questions. 


a. Using the same explanatory variables as in Table 22.2, estimate a model with 
gamma-distributed heterogeneity and a log-logistic hazard for «(t;x). Does the 
model fit better or worse than the Weibull/gamma mixture model reported in Table 
22.2? 


b. Graph the conditional hazard function (with v = 1) and at the mean values of the 
covariates. How does its shape compare with the Weibull/gamma model? 

c. Graph the unconditional hazard function and comment on its shape. How do the 
conditional and unconditional hazards differ? 
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Dummy variable estimator, 308—310, 328 
Dummy variable regression, 307—310 
Duration analysis, 983—1019 
competing risks model, 1019 
Cox’s partial likelihood method for proportional 
hazard model, 1018 
grouped duration data, 1010-1018 
time-invariant covariates, 1011-1015 
time-varying covariates, 1015-1017 
unobserved heterogeneity, 1017—1018 
hazard functions, 984-991 
conditional on time-invariant covariates, 
988—989 
conditional on time-varying covariates, 989—991 
without covariates, 984—988 
multiple-spell data, 1018—1019 
single-spell data with time-invariant covariates, 
991-1010 
flow sampling, 992—993 
maximum likelihood estimation with censored 
flow data, 993-1000 
stock sampling, 1000—1003 
unobserved heterogeneity, 1003—1010 
survival analysis, 983 
Duration dependence, 987 
Durbin-Watson-Hausman (DWH) test, 129-132 
Dynamic completeness 
conditional density, 492—494 
conditional on the unobserved effect, 371—374 
of conditional mean, 194-196 
of pooled probit and logit, 609-610 
Dynamic ignorability, 970—971 
Dynamic unobserved effects models 
binary response models, 625—630 
corner solution responses, 713—715 


Efficiency 
asymptotic, 44, 103-104, 131, 229-231 
generalized method of moments (GMM) estima- 
tion, 538-545 
conditional moment restrictions, 542—545 
general efficiency framework, 538—540 
maximum likelihood estimation, 540—542 
relative efficiency, 539-540 
linear simultaneous equations models (SEMs), 
260-261 


Index 


relative efficiency of two-stage least squares 
(2SLS) analysis, 103-104 
Efficiency bound, 541 
Elasticity, 17 
conditional expectations, 16—18 
Endogenous switching regression, 948 
Endogenous variables, 54—55 
control function approach in single-equation 
linear models, 126—129 
explanatory, 585-594, 630-632, 651-653, 
660-662, 681-685, 753-755, 809-813, 817-819 
exponential regression function, 742—748 
fractional responses, 753—755 
identification in linear systems, 241—242 
identification in nonlinear systems, 262—263 
probit models with heterogeneity and endogenous 
explanatory variables, 630—632 
specification tests, 129-134 
types 
measurement error, 55, 76—82 
omitted, 54-55, 65-76 
simultaneity, 55 
Equivalent structures, 246 
Error terms. See also Omitted variables 
degrees-of-freedom correction, 62, 106-107 
measurement errors, 55, 76—82 
dependent variable in ordinary least squares 
(OLS) analysis, 76-78 
explanatory variable in ordinary least squares 
(OLS) analysis, 78—82 
multiplicative measurement error, 78 
nature of, 8, 18 
standard errors 
heteroskedasticity-robust standard error, 61 
Huber standard error, 61 
two-stage least squares (2SLS) analysis, 
108-112 
White standard error, 61 
Estimable model, ordinary least squares (OLS) 
estimation, 53 
Exchangeable working correlation matrix, 
448—449 
Exclusion restrictions, in simultaneous equations 
models (SEMs), 241—243 
Exogenous sampling, 795 
Exogenous variables, 54 
explanatory variables in sample selection, 
802-808, 815-817 
fractional responses, 748-753 
identification in linear systems, 241—242 
logit models under strict exogeneity, 619—625 
panel data models with unobserved effects, 
494-497 
relaxing strict exogeneity assumption, 764—766 
stratification based on, 861-863 
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unobserved effects of probit models under strict 
exogeneity, 610-619 
Expected Hessian form of the LM statistic, 425 
Experimental data, 5 
Experimental group, difference-in-differences (DD) 
estimation, 147-148 
Explained variable (y). See Dependent variable (y) 
Explanatory variable (x), 13, 14 
binary endogenous, 594—599 
continuous exogenous, 585—594 
endogenous, 585-594, 630-632, 651-653, 
660-662, 681-686, 753-755, 809-813, 817-819 
exogenous, 585-594, 748-753 
fixed, 9-11 
instrumental variables (IV) approach and, 89-90 
in ordinary least squares (OLS) analysis, 
measurement error, 78—82 
probit models with heterogeneity and endogenous 
explanatory variables, 630—632 
Exponential conditional mean, 694-696 
Exponential distribution, 986-987 
Exponential QMLE, 741 
Exponential regression function, 397, 742—748 
Exponential response function, 814 
Exponential type II Tobit (ET2T) model, 697-703, 
790-791, 804, 808 
External covariates, 990 


F statistic 
dummy variable regression, 309 
from OLS analysis, 62, 104-105 
for two-stage least squares (2SLS) analysis, 105, 
112 
Factor loads, 552 
Feasible GLS (FGLS) estimator, 176-188, 
200-201, 269-270 
random effects analysis, 296-299 
Finite distributed lag (FDL) model, 165-166 
First differencing (FD) methods, 315-321 
first-difference (FD) estimator, 316-318, 321-326 
first difference instrumental variables (FDIV) 
estimator, 361—365 
inference, 315-318 
linear unobserved effects models, random trend 
model, 375-377 
policy analysis, 320—321 
robust variance matrix, 318-319 
testing for serial correlation, 319—320 
transformation and, 316-318 
First-order asymptotic distribution, 407 
First-stage regression, two-stage least squares 
(2SLS) analysis, 97, 105 
Fixed effects (FE) estimators, 287, 300-304, 310, 
332, 366, 495 
first differencing estimators versus, 321—326 
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Fixed effects (FE) estimators (cont.) 
in cluster sampling, 867—870, 876, 880 
logit estimators, 621—623 
random effects estimators versus, 326-334, 
349-358 
Fixed effects (FE) framework, 286, 300-315 
asymptotic inference with fixed effects, 304—307 
dummy variable regression, 307—310 
estimators, 287, 300—304, 310 
fixed effects generalized least squares (FEGLS), 
300, 312-315 
Hausman test comparing random and fixed 
effects estimators, 328—334, 355-358 
Poisson estimation, 762—764 
random trend model, 375-377 
robust variance matrix estimator, 310-312 
robustness of standard methods, 382-384 
serial correlation, 310-311 
time-varying coefficients on unobserved effects, 
552-554 
Fixed effects GLS (FEGLS), 300, 312-315 
Fixed effects instrumental variables (FEIV) 
estimator, 353—358, 364-365 
Fixed effects Poisson (FEP) estimator, 763 
Fixed effects Poisson model, 756 
Fixed effects residuals, 306-307 
Fixed effects 2SLS (FE2SLS) estimator, 353-358 
Fixed effects transformation, 302—303 
Fixed explanatory variables, 9-11 
Flow sampling, 992-993 
Forbidden regression, 267—268 
Fractional logit regression, 751 
Fractional probit regression, 751 
Fractional responses, 748-755 
endogenous explanatory variables, 753—755 
exogenous explanatory variables, 748-753 
panel (longitudinal) data, 766-769 
Fully recursive system, in simultaneous equations 
models (SEMs), 259-260 
Fully robust variance matrix estimator, 416 
Fuzzy regression discontinuity (FRD) design, 
957-959 


Gamma (exponential) regression model, 740-742 
Gamma QMLE, 741 
Gamma regression model, 741 
Gamma-distributed heterogeneity, 1004 
Gauss-Markov assumptions, 308 
Gauss-Newton optimization method, 434—435 
General linear system of equations, 210—213 
Generalized condition information matrix equality 
(GCIME), 513-514 
Generalized estimating equation (GEE), 446-447 
for cluster sampling, 872 
for panel data, 514—517, 614 
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Generalized extreme value distribution, 651 
Generalized Gauss-Newton optimization method, 
434-435 
Generalized information matrix equality (GIME), 
417 
Generalized instrumental variables (GIV) 
estimator, 207, 222-226 
comparison with generalized method of moment, 
224-226 
comparison with three-stage least squares (3SLS) 
estimator, 224—226, 254-255 
derivation, 222—224 
Generalized inverse, 312 
Generalized least squares (GLS) analysis 
cluster sampling, 866, 878 
fixed effects (FEGLS), 300, 312-315 
random effects analysis, 292—294 
systems of equations, 161, 173—184, 200-201 
asymptotic normality, 175-176 
consistency, 173—175 
feasible GLS (FGLS), 176-188, 200-201 
ordinary least squares (OLS) analysis versus, 
185-188 
weighted multivariate nonlinear least squares 
(WMNLS) estimator, 444—449 
Generalized linear model (GLM), 512-513 
binomial variance assumption, 739-740 
Poisson, 725, 729-730 
standard error, 729—730 
Generalized method of moments (GMM) 
estimation, 207, 213-222, 226-229, 232-233, 
525-555, 748 
asymptotic properties, 525-530 
efficient estimation, 538—545 
general weighting matrix, 213-216 
in cluster sampling, 871—872 
in linear unobserved effects panel data model, 
345-349, 370-371, 372-374 
in nonlinear simultaneous equations (SEMs), 270, 
272 
minimum distance estimation 
classical, 545-547 
unobserved effects model, 549-551 
optimal weighting matrix, 217—219 
panel data application, 547-555 
minimum distance approach to unobserved 
effects model, 549-551 
nonlinear dynamic models, 547—549 
time-varying coefficients on unobserved effects, 
551-555 
systems of nonlinear equations, 532—538 
three-stage least squares (3SLS) analysis, 
219-222, 232 
two-stage least squares (2SLS) estimator, 216-217 
under orthogonality conditions, 530—532 


Index 


Generalized method of moments (GMM) 
estimator 
comparison with generalized instrumental 
variables (GIV) estimator, 224-226 
statement of, 525-530 
testing classical hypotheses, 226—227 
testing overidentification restrictions, 228—229, 
551 
Generalized propensity score, 963 
Generalized residual function, 530 
Generated instruments, 125 
estimating, 125-126 
GMM with generated instruments, 543-545 
two-stage least squares (2SLS) analysis, 
124-125 
Generated regressors, 123 
estimating, 125-126 
ordinary least squares analysis in single-equation 
linear models, 123—124 
Geometric QMLE, 738 
GLM variance assumption, 512—513, 725 
GMM. See Generalized method of moments 
(GMM) estimation 
GMM criterion function statistic, 529-530 
GMM distance statistic, 227, 529-530 
GMM three-stage least squares (GMM 3SLS) 
estimator, 219-222, 232 
GMM with generated instruments, 543-545 
Goodness-of-fit measure, 573—574, 677 
Grouped duration data, 1010-1018 
time-invariant covariates, 1011-1015 
time-varying covariates, 1015-1017 
unobserved heterogeneity, 1017—1018 


Hausman test, 324-325, 328-334 
comparing random and fixed effects estimators, 
328-334, 355-356 
computing Hausman statistic, 331—332 
Hausman and Taylor (HT) model, 358-361 
key component, 330 
Heckit procedure, 699, 805-808 
Heckman’s method (Heckit), 699, 805-808, 
838-849 
Hedonic price system, 535-536 
Hessian form of the LM statistic, 424—427 
Hessian of the objective function, 406—409, 414, 
417, 420 
Heterogeneity 
neglected, 582—585, 680—681 
probit models with heterogeneity and endogenous 
explanatory variables, 630—632 
unobserved, 22, 285, 1003-1010, 1017-1018 
Heterokurtosis-robust test, 141 
Heteroskedasticity 
data censoring, 781 
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for two-stage least squares (2SLS) analysis, 
106-107 
heteroskedastic probit model, 602—605, 686-687 
heteroskedasticity-robust rf statistics, 62, 106-107, 
136-138 
in latent variable model, 599—604, 606, 685-687 
ordinary least squares (OLS) analysis, 60-62 
specification tests, 138—141 
standard error, 61—62 
testing after pooled ordinary least squares 
(POLS) analysis, 199-200 
Heteroskedasticity-robust variance matrix 
estimator for NLS, 416-417 
Hierarchical linear models (HLM), in cluster 
sampling, 876-877, 879, 881 
Hierarchical model, 649 
Histogram estimator, 457 
Homogenous linear restrictions, 247 
Homokurtosis, 139—140 
Homoskedasticity 
assumption in sample selection, 796-797 
asymptotic efficiency, 231 
ordinary least squares (OLS) analysis, 59—60, 
131, 220-221 
system homoskedasticity assumption, 180—182 
Huber standard error, 61 
Huber-White sandwich estimator, 415—416, 
446-447 
Hurdle models 
lognormal, 694—696 
nature of, 691 
truncated normal, 692-694, 695 
Hypothesis testing 
in binary response index models, 569-573 
multiple exclusion restrictions, 570-571 
nonlinear hypothesis about (BETA), 571 
tests against more general alternatives, 
571-573 
in maximum likelihood estimation (MLE), 
481—482 
nonlinear regression model, 420—431 
behavior of statistics under alternatives, 
430-431 
change in objective function, 428—430 
score (Lagrange multiplier) tests, 421-428 
Wald tests, 420-421 
Poisson quasi-maximum likelihood estimator 
(QMLE), 732-734 
two-stage least squares (2SLS) analysis, 104-106 


Identification assumption, 13—14 
Identification problem, 91—92 

Identified due to a nonlinearity, 265—266 
Idiosyncratic disturbances, 285 
Idiosyncratic errors, 285 
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Ignorability of treatment, 908—936 
identification, 911-915 
matching methods, 934-936 
propensity score methods, 920-934 
regression adjustment, 915—920, 930—934 
Ignorable selection, 822 
Imperfect proxy variables, 69 
Incidental parameters problem, 495, 612 
Incidental truncation, 777—778, 802-821 
exogenous explanatory variables, 802-808 
Tobit selection equation, 815-821 
Independence from irrelevant alternatives (IIA) 
assumption, 648—649 
Independent, not identically distributed (i.n.i.d.) 
sample, 6, 146-147 
Independent identically distributed (i.i.d.) sample, 
5 
Independent variable. See Explanatory variable (x) 
Index models, 565—582 
binary response models 
logit, 566, 568, 573-582 
maximum likelihood estimation (MLE), 
567-569 
probit, 566, 568, 573-582, 587, 591, 594-599 
reporting results, 573—582 
testing, 569-573 
maximum likelihood estimation (MLE) for 
binary response, 567-569 
Index structure, 512 
Indicator function, 450 
Individual effects, 285 
Individual heterogeneity, 285 
Individual-specific slopes, 374-387 
general models, 377—381 
random trend model, 375-377 
Influence function representation, 406—407 
Influential observations, 451—453 
Initial condition, 497—498 
Initial conditions problems, 626 
Instrumental variables (IV) estimation, 142-143, 
207-233 
average treatment effect (ATE), 937-954 
continuous endogenous explanatory variables, 
591-594 
first difference instrumental variables (FDIV) 
estimator, 361—365 
fixed effects (FE) methods, 353—358 
instrumental variable characteristics, 90 
nonlinear instrumental variables estimator, 531 
random effects (RE) methods, 349-353 
single-equation linear models, 89-114 
examples, 93—96 
motivation for, 89—98 
omitted variables problem, 92—93, 112-114 


Index 


two-stage least squares (2SLS) analysis. See 
Two-stage least squares (2SLS) analysis 
systems of equations, 207—233 
examples, 207—210 
general linear system of equations, 210-213 
generalized instrumental variables (GIV) 
estimator, 207, 222-226 
generalized linear system of equations, 210-213 
generalized method of moments (GMM) 
estimation, 207, 213-222, 226-229, 232-233, 
522-525, 748 
introduction, 207 
optimal instruments, 229—232 
simultaneous equations model (SEM), 207 
three-stage least squares (3SLS) analysis, 
219-222, 232 
two-stage least squares (2SLS) analysis, 232-233 
Instrumental variables (IV) estimator 
consistency of, 142—143 
of (BETA), 92 
Internal covariates, 991 
Interval coding, 783-785 
Interval regression, 783—785 
Inverse Mills ratio, 672-673, 838 
Inverse probability weighting (IPW), 778, 
821-827, 840-844 
Iteration. See Law of iterated expectations (LIE) 


Jensen’s inequality, 31—32 
Just identified equations, 251 


Kaplan-Meier estimator, 1014-1015 

Kernel density estimator, 457 

Kernel estimators, 915 

Knowledge of the world of work (K WW) test, 72, 
812-813 

Kullback-Leibler information criterion (KLIC), 
473-474, 503-505, 523-524 


Lagged dependent variables, 290, 371—374, 
497-499 
Lagrange multiplier (score) statistic, 62-65, 107, 
299, 421—428 
Large-sample theory. See Asymptotic analysis 
Latent class model, 651 
Latent variable, 285, 471-472 
model of, 565 
heteroskedasticity, 599-604, 606, 685-687 
nonnormality, 599-604 
Law of iterated expectations (LIE), 18-22, 30-32, 
34-36, 288, 414, 672 
general statement, 19, 21 
identification problem, 20 
proxy variables, 23-24, 67—72 


Index 


two-stage least squares (2SLS) analysis, 101 
Law of large numbers 
statement of, 61, 401 
weak law of large numbers (WLLN), 42, 
403-404, 526 
Least absolute deviations (LAD) estimator, 404, 
451-453, 871 
Least squares. See Multivariate nonlinear 
regression methods; Ordinary least squares 
(OLS) analysis; Three-stage least squares 
(3SLS) analysis; Two-stage least squares 
(2SLS) analysis 
Least squares linear predictor, 26-27 
Left censoring, 1000 
Left truncation, 1000 
Length-biased sampling, 1000 
Likelihood ratio (LR) statistic, 481-482, 677 
Limited dependent variable, 559 
Limited information maximum likelihood (LIML) 
estimator, 745-746 
Limited information procedure, 591—592 
Limited range responses, average treatment effect 
(ATE), 960-961 
Limiting behavior, of estimators, 42—45 
Linear exponential family (LEF), 509-514 
Linear panel data models 
attrition, 837-845 
linear unobserved effects. See Linear unobserved 
effects panel data models 
sample selection, 827-837 
fixed effects estimation with unbalanced panels, 
828-831 
random effects estimation with unbalanced 
panels, 831—832 
testing and correcting for selection bias, 
832-837 
systems of equations, 163—167, 191—201 
assumptions for pooled ordinary least squares, 
191-194 
contemporaneous exogeneity, 164, 165 
dynamic completeness, 194—196 
example, 163-166, 167—168, 169-171, 191-192 
feasible generalized least squares under strict 
exogeneity, 200-201 
finite distributed lag (FDL) model, 165-166 
pooled ordinary least squares, 191—194, 198-199 
robust asymptotic variance matrix, 197—198 
sequential exogeneity, 164-165 
strict exogeneity, 165-166 
testing for heteroskedasticity, 199-200 
testing for serial correlation, 198—199 
time series persistence, 196—197 
Linear probability model (LPM), for binary 
response, 562—565 
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Linear projections, 25-27 
least squares linear predictor, 26—27 
minimum mean square linear predictor, 26—27 
partial linear model, 29 
properties, 34—36 
Linear simultaneous equations models (SEMs), 
241-261 
cross equation restrictions, 256—257 
efficiency, 260-261 
estimation after identification, 252-256 
exclusion restrictions, 241—243 
general linear restrictions, 245—248 
identification, 251, 256-261 
order condition, 245, 249, 250-251 
rank condition, 248—251 
reduced forms, 243-245, 255-256 
structural equations, 241-242, 245-251, 260-261 
Linear unobserved effects panel data models, 
281-387 
assumptions, 285—290 
random versus fixed effects, 285—287 
strict exogeneity, 287—288 
comparison of estimators, 321—334 
fixed effects versus first differencing, 321—326 
Hausman test, 328—334 
random effects and fixed effects estimators, 
285-287, 326-334 
correlated random slopes, 384-387 
estimating by pooled ordinary least squares, 291 
estimation under sequential exogeneity, 368—374 
examples, 289-290 
first differencing methods, 315-321 
inference, 315-318 
instrumental variables, 361—365 
policy analysis, 320-321 
robust variance matrix, 318-319 
testing for serial correlation, 319—320 
fixed effects methods, 300-315 
asymptotic inference with fixed effects, 304—307 
consistency of fixed effects estimator, 300-304 
dummy variable regression, 307—310 
fixed effects versus first differencing, 321—326 
instrumental variables, 353—358 
random effects and fixed effects estimators, 
285-287, 326-334 
robust variance matrix estimator, 310-315 
robustness of standard, 382-384 
serial correlation, 310-315 
generalized method of moments (GMM) 
approaches, 345-349, 370-374 
Hausman and Taylor (HT) models, 358-361 
models with individual-specific slopes, 374-387 
omitted variables problem, 281—285 
random effects models, 285-287, 291-300 
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Linear unobserved effects panel data models (cont.) 
estimation and inference, 291—297 
general feasible generalized least squares 

analysis, 298—299 
instrumental variables methods, 348-353 
robust variance matrix estimator, 297—298 
testing for presence of unobserved effect, 
299-300 
random trend model, 375-377 
unobserved effects, 282 
unobserved effects models with measurement 
error, 365—368 
Link function, 512 
LM statistic 
behavior under alternatives, 430—431 
expected Hessian form of the, 425 
Hessian form of the, 424—427 
outer product of the score, 424 

Local alternatives, 45-46, 430-431 

Local average response (LAR), 634-635 

Local average treatment effect (LATE), 906 

Local linear regression, 956 

Local power analysis, 45—46 

Log-logistic hazard function, 987—988 

Log-odds transformation, 749 

Logistic distribution, 988 

Logistic function, 397 

Logit model, 566 

conditional logit (CL) model, 647—649 
fractional logit regression, 751 
logit estimator, 568, 621-624 
mixed logit model, 648 
multinomial logit (MNL) model, 643—648 
nested logit model, 649-651 
ordered, 656—658 
pooled, 609—610 
pooled multinomial logit, 654 
unobserved effects under strict exogeneity, 
619-625 
Lognormal hurdle (LH) model, 694-696 
Longitudinal data. See Panel (longitudinal) data 


M-estimators, 400—401, 403—405 
asymptotic normality, 407—409, 411-413 
in cluster sampling, 871 
in complex survey sampling, 897—899 
sample selection, 841, 843-845 
two-step, 409—413 
adjustments, 418—420 
unweighted M-estimator in stratified sampling, 
861-863 
weighted M-estimator in stratified sampling, 
856-861 
Mahalanobis distance, 934 
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Matching on the covariates, 934 
Matrix of instruments, 530 
Maximum likelihood estimation (MLE), 
469-517 
asymptotic normality, 476-479 
asymptotic variance, 479—481 
binary response index model, 567—569 
conditional maximum likelihood estimation 
(CMLE), 470-471, 473-476 
continuous exogenous explanatory variable, 
591-594 
count variable, 472 
data censoring, 780-781 
efficiency in generalized method of moments 
(GMM) estimation, 540-542 
hypothesis testing, 481—482 
in cluster sampling, 892—893 
index models, 567—569 
panel data models with unobserved effects, 
494-499 
lagged dependent variables, 497—499 
strictly exogenous explanatory variables, 
494-497 
parametric model, 471 
partial (pooled) likelihood methods for panel 
data, 485—494 
asymptotic inference, 490—492 
inference with dynamically complete models, 
492-494 
setup, 486—490 
Poisson regression, 472—473, 475, 477, 478, 481, 
485 
probit model, 471—472, 475, 477, 478, 480-481, 
488—489, 492 
quasi-maximum likelihood estimation, 502—517 
general misspecification, 502—504 
generalized estimating equations for panel data, 
514-517 
in linear exponential family, 509-514 
model selection tests, 505-509 
specification testing, 482—485 
two-step estimators involving maximum 
likelihood, 499—502 
first-step estimator, 500-502 
second-step estimator as maximum likelihood 
estimator, 499 
Maximum score estimator, 605-606 
Mean independent, average treatment effect 
(ATE), 907 
Measurement error 
endogenous variable, 55 
explanatory variable, 78—82 
in ordinary least squares (OLS) analysis, 76-82 
attenuation bias, 81 
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classical errors-in-variables (CEV) assumption, 
80-82, 367 
dependent variable error, 76—78 
examples, 78, 81 
independent variable error, 78—82 
multiplicative measurement error, 78 
proxy variable versus, 76, 92—93, 113 
unobserved effects models, 365—368 
Measures, 522 
Median regression, 404—405 
Method of moment, ordinary least squares (OLS) 
analysis, 57—58 
Mills ratio, 809-812 
Minimum chi-square estimator, 528, 545 
Minimum distance (MD) estimates, 545-547, 
889-894 
Minimum mean square linear predictor, 26—27 
Missing at random (MAR), 795 
Missing completely at random (MCAR), 794-795, 
827 
Mixed models 
in cluster sampling, 876-877 
mixed logit model, 648 
Mixture model, 651, 1003 
Model selection tests, 505—509 
Monte Carlo simulation, 436-438 
Multinomial logit (MNL) model, 643—648 
Multinomial probit model, 649 
Multinomial response models, 643—654 
endogenous explanatory variables, 651—653 
multinomial logit (MNL) model, 643—648 
ordered response models, 643, 655-663 
endogenous explanatory variables, 660—662 
ordered logit, 656—658 
ordered probit, 655-658 
panel data methods, 662—663 
specification issues, 658-659 
panel data methods, 653—654 
probabilistic choice models, 646—651 
Multinomial sampling, 855 
Multiple exclusion restrictions, 570-571 
Multiple indicator solution, 112-114 
Multiple-spell data, 1018—1019 
Multiple treatments, average treatment effect 
(ATE), 964-967 
Multiplicative measurement error, 78 
Multiplicative random effects model, 759-760 
Multistage sampling, 894-895 
Multivalued treatments, average treatment effect 
(ATE), 961-963 
Multivariate nonlinear least squares estimator, 
443 
Multivariate nonlinear regression methods, 
442-449 
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multivariate nonlinear least squares, 442—444 
weighted multivariate nonlinear least squares, 
444-449 
Mundlak representation, 461, 553—554 


N-consistent estimator, 44—45 
N-equivalent estimator, 44-45 
N-R-squared test, 63 
Natural experiments, 94—95, 147 
Negative duration dependence, 987 
Neglected heterogeneity, 582-585 
Tobit models, 680—681 
Nested logit model, 649-651 
Newey-Tauchen-White (NTW) statistic, 484-485 
Newton-Raphson optimization method, 432—433 
NLS estimator, 400 
Nominal response, 643 
Nonlinear endogenous variables, 262—263 
Nonlinear estimation 
generalized method of moments (GMM) 
approach. See Generalized method of moments 
(GMM) estimation 
maximum likelihood methods. See Maximum 
likelihood estimation (MLE) 
nonlinear regression model. See Nonlinear 
regression model 
Nonlinear hypothesis testing, about (BETA), 571 
Nonlinear least squares (NLS) assumption, 
399-400, 408-409, 669 
multivariate, 442—444 
variable addition test (VAT) approach, 427—428, 
573 
weighted nonlinear least squares analysis, 
409-413 
Nonlinear least squares residuals, 416 
Nonlinear panel data models, generalized method 
of moments (GMM) estimation, 547-555 
minimum distance approach to unobserved effects 
model, 549-551 
models with time-varying coefficients on 
unobserved effects, 551—555 
nonlinear dynamic models, 547—549 
Nonlinear regression model, 397—462 
asymptotic inference, 454-459 
asymptotic normality, 405—409, 411-413 
asymptotic variance, 413—420 
adjustments for two-step estimation, 418—420 
estimation without nuisance parameters, 
413-418 
bootstrapping, 438—442 
consistency, 410—411 
correctly specified model for the conditional 
mean, 397—398 
hypothesis testing, 420-431 
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Nonlinear regression model (cont.) 
behavior of statistics under alternatives, 430—431 
change in objective function, 428—430 
score (Lagrange multiplier) tests, 421—428 
Wald tests, 420—421 
identification, 401—402 
M-estimator, 400—401, 403—405, 409-413 
adjustments for two-step estimation, 418—420 
two-step, 409—413 
median regression, 404—405 
Monte Carlo simulation, 436—438 
multivariate, 442—449 
nonlinear least squares, 442—444 
weighted multivariate nonlinear least squares, 
444-449 
nonlinear least squares (NLS) assumption, 
399—400, 408-413 
optimization methods, 431—436 
Berndt, Hall, Hall, and Hausman algorithm, 
433-434 
concentrating parameters out of objective 
function, 435—436 
generalized Gauss-Newton method, 434-435 
Newton-Raphson method, 432-433 
parametric model, 397 
quantile estimation, 449-462 
consistency, 453 
estimation problem, 449—453 
quantile regression panel data, 459—462 
quantiles, 449—453 
resampling methods, 438—442 
uniform convergence in probability, 402—403 
Nonlinear simultaneous equations models (SEMs), 
262-271 
control function estimation for triangular 
systems, 268-271 
different instruments for different equations, 
271-273 
estimation, 266-271 
forbidden regression, 267—268 
three-stage least squares (3SLS) analysis, 
266-267 
two-stage least squares (2SLS) analysis, 266-267 
identification, 262—266 
Nonlinear 2SLS (N2SLS) estimator, 534-535 
Nonlinear SUR estimator, 447 
Nonnormality 
data censoring, 781 
in latent variable model, 599-604 
Nonparametric bootstrap, 438 
Nonparametric density estimator, 456—457 
Nonparametric regression, 605 
Nonparametric residual bootstrap, 441 
Nonsymmetric test, 440 


Index 


Normal distribution, asymptotically normal, 
40-41, 43-44, 175-176, 405—409, 411—413, 
728-732 

Normalization restriction, 247 

Normalized differences, 917 


Objective function 
change in, 428—430 
concentrating parameters out of, 435—436 
Hessian of the, 406—409, 414, 417, 420 
OLS. See Ordinary least squares (OLS) analysis 
OLS equation by equation, 169 
Omitted variables 
linear unobserved effects panel data models, 
281-285 
ordinary least squares (OLS) 
bias, 66-67 
ignoring omitted variables, 65—67 
inconsistency, 66—67 
nature of, 54-55 
proxy variable, 67-72 
solutions to, 65-76 
two-stage least squares (2SLS), 112 
Optimization methods, nonlinear regression model, 
431-436 
Berndt, Hall, Hall, and Hausman algorithm, 
433-434 
concentrating parameters out of objective 
function, 435—436 
generalized Gauss-Newton method, 434—435 
Newton-Raphson method, 432—433 
Order condition, 99, 211 
in simultaneous equations models (SEMs), 245, 
249-251 
Ordered response models, 643, 655-663 
ordered logit model, 656-657 
ordered probit model, 655—657 
parallel regression function, 658—659 
specification issues, 658—659 
Ordinary least squares (OLS) analysis, 6, 10. See 
also Pooled ordinary least squares (POLS) 
analysis 
incidental truncation in sample selection, 806-808 
influential observations, 451—453 
linear models estimation by, 796-798 
linear unobserved effects panel data models, 
283-284 
OLS estimator, 127-128 
pooled OLS estimation, 865-866 
simultaneous equations, 253-255 
single-equation linear models, 53—82 
asymptotic properties, 54—65 
estimable model, 53 
examples, 63—65, 67, 69-72, 75 


Index 


generated regressors, 123—124 
omitted variable problem, 65—76 
population model, 53 
properties under measurement error, 55, 76-82 
structural model, 53, 66 
weighted least squares (WLS) in, 60—61 
zero-mean assumption, 53-54 
specification tests 
endogeneity, 129-134 
functional form, 137—138 
heteroskedasticity, 138—141 
overidentifying restrictions, 134-137 
system estimation methods, 269-270 
systems of equations, 161, 166—173, 198—200 
asymptotic properties of system OLS, 167—172 
feasible generalized least squares versus, 185-188 
pooled ordinary least squares (POLS) estimator, 
169-172, 191-194, 198-199 
preliminaries, 166-167 
testing multiple hypotheses, 172—173 
Tobit I model, 677—680 
two-stage least squares (2SLS) compared with, 
107-112 
Orthogonality condition 
general method of moments (GMM) estimation, 
530-532 
population, 56-57 
Outer product of the score LM statistic, 424 
Overdispersion, 512—513 
Overidentification exogenous variable, 98 
Overidentification test statistic, 228—229, 551 
Overidentified equations, 251 
Overidentifying restrictions, 98, 251 
specification tests, 134-137 
Overlap assumption, 910 


Panel (longitudinal) data, 6-7. See also Linear 
panel data models; Nonlinear panel data models 
average treatment effect (ATE), 968-975 
binary response model, 608—635 
dynamic unobserved effects models, 625-630 
pooled probit and logit, 609-610 
probit models with heterogeneity and 
endogenous explanatory variables, 630—632 
semiparametric approaches, 605, 632—635 
unobserved effects logit models under strict 
exogeneity, 619-625 
unobserved effects probit models under strict 
exogeneity, 610-619 
cluster sampling with unit-specific panel data, 
876-883 
corner solution responses, 705—715 
dynamic unobserved effects Tobit models, 
713-715 
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pooled methods, 705—707 
unobserved effects models under strict 
exogeneity, 707-713 
count data, 755-769 
conditional expectations with unobserved effects, 
758-759 
fixed effects Poisson estimation, 762—764 
fractional response model, 766-769 
pooled QMLE, 756-758 
random effects methods, 759-762 
relaxing strict exogeneity assumption, 764—766 
fractional response models for panel data, 
766-769 
models with lagged dependent variables, 497—499 
models with unobserved effects, 494—499 
strictly exogenous explanatory variables, 
494-497 
multinomial response models, 653-654 
ordered response models, 662—663 
partial (pooled) likelihood methods, 485-494 
Parallel regression assumption, 658—659 
Parameter space, 397 
Parametric bootstrap, 440-441 
Parametric model, 14, 397, 471 
Partial effect at the average (PEA), 575-577 
Partial effects 
average partial effects (APE), 22-25, 73, 141-142 
binary response models, 577—582, 586-589, 615 
conditional expectations, 15—16, 22-25 
estimating, 3—4 
nature of, 3 
structural, 583—585 
Partial linear model, 29 
Partial log likelihood, 486 
Partial (pooled) likelihood methods for panel data, 
485-494 
asymptotic inference, 490—492 
inference with dynamically complete models, 
492-494 
setup, 486—490 
Partial (pooled) maximum likelihood estimator 
(PMLE), 487-489 
Partial QMLE, 514 
Participation decision, 690—691 
Partitioned projection formula, 36 
Pearson dispersion estimator, 513 
Pearson residuals, 513 
Percentile-t method, 440 
Piecewise-constant proportional hazard, 
1013-1014 
Point-wise convergence in probability, 402 
Poisson GLM variance assumption, 725 
Poisson quasi-maximum likelihood estimator 
(QMLE), 727-732, 741, 745 
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Poisson random effects model, 760 
Poisson regression, 724-736 
assumptions, 724—727 
asymptotic normality, 728-732 
consistency, 727—728 
fixed effects model, 756, 762—764 
hypothesis testing, 732—734 
maximum likelihood estimation (MLE), 472-473, 
475, 477, 478, 481, 485 
quantities of interest, 724—727 
random effects model, 760 
specification testing, 734-736 
Poisson variance assumption, 725 
Policy analysis 
difference-in-differences estimation, 147-151 
first differencing, 320-321 
fixed effects estimation, 315 
Pooled bivariate probit, 631—632 
Pooled cross section, 5—6 
over time, 146-147 
Pooled IV probit, 631—632 
Pooled least absolute deviations, 461—462 
Pooled log likelihood, 486 
Pooled logit, 609-610 
Pooled multinomial logit, 654 
Pooled negative binomial analysis, 761 
Pooled nonlinear least squares (PNLS) estimator, 
443-444 
Pooled ordered probit, 662—663 
Pooled ordinary least squares (POLS) analysis, 
198-199. See also Ordinary least squares 
(OLS) analysis 
assumptions, 191—194 
in cluster sampling, 867—870, 877-878, 882—884 
OLS test statistic, 332—333 
pooled ordinary least squares (POLS) estimator, 
169-172, 191-194, 198-199 
testing for heteroskedasticity, 199-200 
testing for serial correlation, 198—199 
unobserved effects models, 291, 365-366 
Pooled Poisson QMLE, 756-758 
Pooled probit, 609-610, 662—663 
Pooled probit estimator, 488—489 
Pooled QMLE, 514 
Pooled quantile regression, 459—460 
Pooled weighted nonlinear least squares (PWNLS) 
estimator, 447 
Poorly identified model, 265-266, 402 
Population-averaged (PA) model, 614-615 
Population-averaged parameters, 612—613 
Population model, 5 
ordinary least squares (OLS) estimation, 53 
Population orthogonality condition, 56—57 
Positive duration dependence, 987 
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Prais-Winston estimator, 201 
Primary sampling unit (PSU), 894 
Probabilistic choice models, 646-651 
Probability 
boundedness in, 37—40 
convergence in asymptotic analysis, 38 
with probability approaching one (w.p.a.1.), 
39—40 
Probability limit (plim), 38 
Probit model, 471—472, 475, 477, 478, 480—481, 
488—489, 492, 566 
attrition in linear panel data models, 838 
bivariate, 595-599 
conditional, 649 
fractional probit regression, 751, 753 
heteroskedastic, 602—605, 686-687 
incidental truncation in sample selection, 802-814 
index models, 566, 568, 573-582, 587, 591, 
594-599 
instrumental variables probit (IV probit), 591 
lognormal hurdle model, 696 
multinomial, 649 
ordered, 655-658 
pooled, 609-610, 662-663 
probit estimator, 568 
random effects ordered probit, 663 
reporting results for, 573—582 
unobserved effects under strict exogeneity, 
610-619 
with heterogeneity and endogenous explanatory 
variables, 630—632 
Propensity scores, 911, 920—934 
matching of, 936 
Proportional hazard models, 988 
Proportional hazard with time-varying covariates, 
991 
Proxy variables 
examples, 69-72 
formal requirements, 67—69 
imperfect, 69 
law of iterated expectations (LIE), 23—24, 67-72 
measurement error versus, 76, 92—93, 113 
ordinary least squares (OLS) analysis, 67—72 
Pseudo-maximum likelihood estimator, 503 
Pseudo-R-squared measures, 574-575, 580 


QLR (quasi-likelihood ratio) statistic, 429—430, 
532 
QMLE. See Quasi-maximum likelihood estimation 
(QMLE) 
Quantile estimation, 449-462 
Quantile regression, 404—405 
for panel data, 459—462 
quantile regression estimator, 451 


Index 


Quasi-likelihood ratio (QLR) statistic, 429-430, 
532 
Quasi-log-likelihood (pseudo-log-likelihood) 
function, 503 
Quasi-maximum likelihood estimation (QMLE) 
in linear exponential family, 509-514 
maximum likelihood estimation (MLE), 502-517 
general misspecification, 502—504 
generalized estimating equations for panel data, 
514-517 
in linear exponential family, 509-514 
model selection tests, 505-509 
partial, 514 
Poisson, 727—732, 745 
pooled, 514, 756-758 
random effects analysis, 760-761 
Quasi-time demeaning, 327, 328 


R-absolute loss function, 450 
Random coefficient models, 73-76 
correlated, 141-146 
described, 74-75 
example, 75 
Random effects (RE) 
analysis, 291-300, 495, 759-762 
estimation and inference, 291—297 
estimators in, 287, 294-297, 326-328, 552-554, 
866-868 
fixed effects analysis versus, 326-328, 349-358 
framework of, 286-287 
general feasible generalized least squares 
(FGLS) analysis, 298—299 
Hausman test comparing random and fixed 
effects estimators, 328—334, 355-358 
in cluster sampling, 867-871, 878-879, 882 
multiplicative random effects model, 759-760 
Poisson model, 760 
quasi-MLE random effects analysis, 760-761 
robust variance matrix estimator, 297—298 
testing for presence of unobserved effect, 
299-300, 552-554 
weighted multivariate nonlinear least squares 
estimator (WMNLS), 761-762 
logit model, 619-620 
probit estimator, 613—614 
probit model, 616-619, 663 
structure of, 294 
Tobit model, 709, 711-713 
2SLS estimator, 352—353 
Random effects instrumental variables (REIV) 
estimator, 349-353 
Random growth model, 375-377 
Random sampling, 5 
limit theorems, 41—42 
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random sequence, 38—39 
Random trend model, 375-377 
Rank condition for identification, 92, 99-100, 
210-211, 248-251 
Rao’s score principle, 422—423 
Recursive system, in simultaneous equations 
models (SEMs), 259-260 
Reduced-form equation 
in simultaneous equations models (SEMs), 
243-245, 255-256 
nature of, 90-91, 243 
Redundancy, of proxy variables, 67—68 
Regress and (y). See Dependent variable (y) 
Regression. See Nonlinear regression model; 
Poisson regression; Seemingly unrelated 
regressions (SUR) model 
Regression adjustment, 915 
average treatment effect (ATE), 915-920 
ignorability of treatment, 915—920, 930—934 
Regression discontinuity designs, 954-959 
fuzzy regression discontinuity (FRD) design, 
957-959 
sharp regression discontinuity (SRD), 954-957 
unconfoundedness, 959 
Regressor (x). See Explanatory variable (x) 
Regularity conditions, 4—5 
Resampling methods, bootstrapping, 438—442 
RESET, 137-138, 482, 734 
Response probability, 561 
Response surface analysis, 437—438 
Response variable (y). See Dependent variable (y) 
Restricted model estimator, 62—63 
Right censoring, 785—786, 993 
Risk set, 1014 
Robust variance matrix estimator, 172, 297-298 
first-differencing methods, 318—319 
fixed effects serial correlation, 310-315 
Rubin causal model (RCM), 903 


Sample selection, 6, 790-845 
attrition, 837—845 
exponential response function, 814 
general attrition, 837—845 
incidental truncation, 802—814 
binary response model, 813-814 
endogenous explanatory variables, 809-813, 
817-819 
exogenous explanatory variables, 802—808, 
815-817 
Tobin selection equation, 815-821 
inverse probability weighting (IPW) for missing 
data, 821-827, 840-844 
linear models, 792-798 
linear panel data models, 827—845 
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Sample selection (cont.) 
fixed effects estimation with unbalanced panels, 
828-831 
random effects estimation with unbalanced 
panels, 831-832 
sample selection bias, 832-837 
nonlinear models, 798—799 
structural Tobit model, 819—821 
truncated regression, 799—802 
Sampling. See Cluster sampling; Sample selection; 
Stratified sampling 
Sampling weights, 857 
Sargan-Hausman test, 135 
Score of the log likelihood, 476—478 
Score statistic, 62—65, 107, 299, 421—428 
Second-stage regression, two-stage least squares 
(2SLS) analysis, 97, 105 
Secondary sampling units, 894 
Seemingly unrelated regressions (SUR) model, 
systems of equations, 7, 185-191 
example, 161—163, 167—169, 186-189 
ordinary least squares versus feasible generalized 
least squares, 185—188 
singular variance matrices, 189-191 
systems with cross equation restrictions, 188—189 
Selected sample, 790 
Selection indicator, 793 
Selection mechanisms, 790 
Selection model, 692 
Selection on observables, 795-796, 909 
Self-selection problem, 289—290, 907—908 
Semielasticity, conditional expectations, 18 
Semiparametric estimators, 605, 632-634 
Semiparametric method, 688—689 
Semirobust variance matrix estimator, 415—416 
Sequential exogeneity 
estimation in unobserved effects model, 368—374 
systems of equations, 164—165 
Sequential moment restrictions, 368—371, 765-766 
Sequentially exogenous conditional on the 
unobserved effect, 368-371 
Sequentially exogenous covariates, 990 
Serial correlation, 172 
first-differencing methods, 319-320 
inference with fixed effects, 305—306 
robust variance matrix estimator, 310-315 
testing after pooled ordinary least squares 
(POLS) analysis, 198—199 
Series estimators, 915 
Sharp regression discontinuity (SRD) design, 
954-957 
Simultaneity, endogenous variable, 55 
Simultaneous equations models (SEMs), 207, 
239-273 
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autonomy, 239, 247 
causality, 239—240 
examples, 240—241 
linear equations, 241—261 
covariance restrictions, 257—260 
cross equation restrictions, 256—257 
efficiency, 260—261 
estimation after identification, 252—256 
exclusion restrictions, 241—243 
general linear restrictions, 245—248 
identification, 251, 256-261 
order condition, 245, 249-251 
rank condition, 248—251 
reduced forms, 243-245, 255-256 
structural equations, 241-242, 245-251 
nonlinear equations, 262-271 
control function estimation for triangular 
systems, 268-271 
different instruments for different equations, 
271-273 
estimation, 266—268 
identification, 262—266 
scope, 239-241 
structural equations, 241—242, 245-251, 260-261 
Single-equation linear models 
control function approach to endogeneity, 
126-129, 145-146 
correlated random coefficient models, 141—146 
difference-to-difference estimation, 147—151 
estimation with generated regressors and 
instruments, 123-126 
instrumental variables estimation, 89-114 
measurement error problem, 112—114 
motivation for, 89—98 
omitted variables problem, 112 
two-stage least squares (2SLS), 96-112 
ordinary least squares (OLS) analysis. See 
Ordinary least squares (OLS) analysis, Single- 
equation linear models 
overview, 53—54 
pooled cross sections over time, 146-147 
specification tests, 129-141 
endogeneity, 129-134 
functional form, 137—138 
heteroskedasticity, 138—141 
overidentifying restrictions, 134—137 
two-stage least squares (2SLS) analysis, 96-112 
asymptotic efficiency, 103—104 
asymptotic normality, 101—102 
consistency, 98—100, 108 
heteroskedasticity-robust inference, 106—107 
hypothesis testing, 104-106 
potential pitfalls, 107—112 
Single-spell data, 991—1010 


Index 


Slutsky’s theorem, 39, 47 
Smearing estimate, 696 
Smith-Blundell procedure, 684—685 
Spatial correlation, 6 
Specification tests, 129-141 
endogeneity, 129-134, 594-599 
functional form, 137—138 
heteroskedasticity, 138—141 
in maximum likelihood estimation (MLE), 
482-483 
in ordered models, 658—659 
nonlinear regression model, 421—422 
overidentifying restrictions, 134—137 
Poisson quasi-maximum likelihood estimator 
(QMLE), 734-736 
Spillover effect, 8—9 
Stable unit treatment value assumption (SUTVA), 
905 
Standard censored regression model. See Type I 
Tobit model 
Standard error 
asymptotic, 44 
bootstrap, 438-439, 581, 590 
GLM standard error, 729—730 
heteroskedasticity-robust standard error, 61 
Huber, 61 
two-stage least squares (2SLS) analysis, 108—112 
White, 61 
Standard stratified sampling (SS sampling), 
854-856, 860 
Standardized residual, 418—419, 570-571 
State dependence, 371-374, 626 
Static models, 163—164 
Stochastic analysis, 4—11 
asymptotic analysis, 7 
data structures, 4—7 
cluster sampling, 6 
cross section data, 5-6 
experimental data, 5 
panel (longitudinal) data, 6-7 
random sampling assumption, 5 
spatial correlation, 6 
examples, 7—9 
setting selection, 4-5 
Stock sampling, 1000-1003 
Stratified sampling, 6, 853—863 
stratification based on exogenous variables, 
861-863 
variable probability sampling, 854-856 
weighted estimators to account for stratification, 
856-861 
Strict exogeneity, 165, 325-326, 329 
strictly exogenous conditional on the unobserved 
effect, 287-288, 495, 610-619 
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strictly exogenous corner solution responses, 
707-713 
strictly exogenous covariates, 990 
Strong ignorability, 911 
Structural conditional expectation, 13 
Structural equations 
estimating Tobit equations with sample selection, 
819-821 
simultaneous equations models (SEMs), 241—242, 
245-251, 260-261 
Structural error, ordinary least squares (OLS) 
analysis, 66—67 
Structural model, ordinary least squares (OLS) 
analysis, 53 
Structural partial effects, 583-585 
Subject-specific (SS) model, 614—615 
Survival analysis, 983 
Survivor function, 985 
Symmetric test, 440 
System homoskedasticity assumption, 180—182 
System instrumental variables (SIV), 207 
System ordinary least squares (SOLS), 167—172 
OLS equation by equation, 169 
pooled ordinary least squares (POLS) estimator, 
169-172, 191-194, 198-199 
system ordinary least squares estimator of 
(BETA), 168-169 
System ordinary least squares (SOLS) estimator of 
(BETA), 168-169 
Systems of equations, 161—233 
examples, 161—166 
generalized least squares (GLS) analysis, 161, 
173-184, 200-201 
asymptotic normality, 175—176 
consistency, 173-175 
feasible GLS (FGLS), 176-188, 200-201 
instrumental variables estimation, 207—233 
examples, 207—210 
general linear system of equations, 210-213 
generalized instrumental variables (GIV) 
estimator, 207, 222-226 
generalized method of moments (GMM) 
estimation, 207, 213-222, 226-229, 232-233 
introduction, 207 
optimal instruments, 229-232 
simultaneous equations model (SEM), 207 
three-stage least squares (3SLS) analysis, 
219-222, 232 
two-stage least squares (2SLS) analysis, 
232-233 
introduction, 161 
linear panel data model, 163—167, 191-201 
assumptions for pooled ordinary least squares, 
191-194 
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Systems of equations (cont.) 
contemporaneous exogeneity, 164-165 
dynamic completeness, 194-196 
example, 163—171 
sequential exogeneity, 164-165 
strict exogeneity, 165-166 
time series persistence, 196-197 
nonlinear equations, 532-538 
ordinary least squares (OLS) analysis, 161, 
166-173, 198-200 
preliminaries, 166—167 
testing multiple hypotheses, 172—173 
seemingly unrelated regressions (SUR) model, 
185-191 
example, 161—163, 167, 169, 186—188 
ordinary least squares versus generalized least 
squares analysis, 185—188 
singular variance matrices, 189-191 
systems with cross equation restrictions, 
188-189 


T statistics 
bootstrapping critical values for, 439-440 
heteroskedasticity-robust f statistics, 62 
Test statistics, asymptotic properties of, 45—47 
Three-stage least squares (3SLS) analysis, 
219-222, 232 
comparison with generalized instrumental 
variables (GIV) estimator, 224-226, 254-255 
comparison with generalized method of moments 
(GMM) analysis, 345-347, 372-374 
estimation in nonlinear simultaneous equations 
models (SEMs), 266-268, 272-273 
in simultaneous equations, 254—255 
nonlinear 3SLS (N3SLS) estimator, 531-532 
traditional 3SLS estimator, 224-226 
two-stage least squares (2SLS) analysis versus, 
233-234 
Threshold parameters, 655 
Time-constant data, fixed effects methods, 
301-302, 328 
Time-demeaning matrix, 303—304, 880-882 
Time series persistence, 196—197 
Time-varying variances, 172 
coefficients in unobserved effects, 551—555 
Tobin’s method. See Tobit models 
Tobit models. See also Type I Tobit model; Type 
II Tobit model; Type III Tobit model 
cluster sampling, 875-876 
correlated random effects, 708, 711-713 
random effects, 709, 711-713 
Tobit selection equation, 817—819 
truncated Tobit model, 801, 815-821 
two-limit, 703-705, 787-788 
unobserved effects, 708 
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Top coding, 779—780 
Traditional random effects probit model, 612—613 
Traditional 3SLS estimator, 224—226 
Treatment effects, average treatment effect (ATE), 
73 
Treatment group, difference-in-differences (DD) 
estimation, 147-148, 150 
Triangular systems 
control function estimation, 268-271 
nature of, 268—269 
Truncated normal hurdle (TNH) model, 692—695 
Truncated normal regression model, 801 
Truncated regression, 799-802 
Truncated Tobit model, 801, 815-821 
Two-limit Tobit model, 703—705, 787-788 
Two-part models, 691 
2SLS. See Two-stage least squares (2SLS) analysis 
Two-stage least squares (2SLS) analysis, 6, 96-112 
estimation in nonlinear simultaneous equations 
models (SEMs), 266-268, 271-273 
examples, 102, 105-106 
first-stage regression, 97, 105 
general treatment, 98—112 
asymptotic efficiency, 103—104 
asymptotic normality, 101—102 
consistency, 98—100 
heteroskedasticity-robust inference, 106—107 
hypothesis testing, 104-106 
potential pitfalls, 107-112 
linear models estimation by, 792—798 
nonlinear system 2SLS estimator, 531 
ordinary least squares (OLS) analysis compared 
with, 107-112 
second-stage regression, 97, 105 
in simultaneous equations models (SEMs), 
254-255, 258-259 
specification tests, 129-141 
endogeneity, 129—134 
functional form, 137—138 
heteroskedasticity, 138—141 
overidentifying restrictions, 134—137 
three-stage least squares (3SLS) analysis versus, 
233-234, 254-255 
two-stage least squares (2SLS) estimator, 96-98, 
100, 124-125 
with generated instruments, 124-125 
Two-stage least squares (2SLS) residuals, 101—102 
Two-step M-estimators, 409—413 
Two-step maximum likelihood estimator, 499 
Two-step partial MLE, 499 
Type I extreme value distribution, 646—647, 998 
Type I Tobit model, 670—689 
data censoring, 782, 787-789 
estimation, 676—677 
inference, 676-677 
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reporting results, 677—680 
specification issues, 680—689 
testing against TNH model, 701-703 
useful expressions, 671—676 
Type II Tobit model, 690—703 
exponential, 697—703 
exponential conditional mean, 694-696 
incidental truncation, 804-806 
lognormal hurdle model, 694-696 
specification issues, 680—689 
truncated normal hurdle model, 692—694 
Type II Tobit model, 815-817 


Uncentered R-squared, 63 
Unconditional hazard, 1005 
Unconditional information matrix equality 
(UIME), 479 
Unconditional maximum likelihood estimation 
(MLE), 541-542 
Unconditional variance matrix, 172, 182—183, 305, 
366 
Unconfoundedness, 908, 959 
assumption of, 822, 959 
nderdispersion, 512—513, 725 
nidentified equations, 251 
niform convergence in probability, 402—403 
niform weak law of large numbers (UWLLN), 
403—404 
nobserved component, 285 
nobserved effects, 282 
attrition in linear panel data models, 837 
Chamberlain approach, 347-349, 551, 616-619, 
624 
dynamic unobserved effects models, 625—630, 
713-715 
logit models under strict exogeneity, 619—625 
minimum distance approach to unobserved effects 
model, 549-551 
models of conditional expectations with, 
758-759 
probit models under strict exogeneity, 
610-619 
strictly exogenous conditional on the unobserved 
effect, 287—288 
strictly exogenous corner solution responses, 
707-713 
testing for presence of, 299-300 
time-varying coefficients on unobserved effects, 
551-555 
unobserved effects model (UEM), described, 
285-287. See also Linear unobserved effects 
panel data models 
Unobserved heterogeneity, 22, 285, 1003—1010, 
1017-1018 
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Unordered response, 643 
Unstructured working correlation matrix, 448 


Variable addition test (VAT), 427—428, 573 
Variable probability sampling (VP sampling), 
854-856, 858-863 
Variables 
in conditional expectations, 13, 14, 23—24 
control, 3—4 
dependent (y). See Dependent variable (y) 
endogenous. See Endogenous variables 
exogenous, 54 
explanatory. See Explanatory variable (x) 
fixed explanatory, 9-11 
independent (x). See Explanatory variable (x) 
indicators of q, 112—114 
indicators of unobservables, 112—114 
interactions between unobservable and 
observable, 73-76 
omitted, 54-55 
in ordinary least squares (OLS) analysis, 65-76 
Variance 
asymptotic. See Asymptotic variance 
bootstrap estimate, 438 
conditional, 32—33, 172 
ordinary least squares (OLS) analysis 
heteroskedasticity, 60—62 
homoskedasticity, 59-60, 131 
Poisson variance assumption, 725 
robust variance matrix estimator, 172, 297—298 
Vuong model selection test, 505—509 


Wage offer function, 7-8 
Wald statistic, 46—47, 62, 104, 107, 136 
behavior under alternatives, 430—431 
for pooled OLS, 332-333 
generalized method of moments (GMM) under 
orthogonality condition, 532 
in bootstrap samples, 440 
Wald tests, 420-421, 423-424 
Weak consistency, 43 
Weak instruments, 108 
Weak law of large numbers (WLLN), 42, 
403-404, 526 
Weibull distribution, 987, 995—998, 1004, 1007, 
1018 
Weighted exogenous sample MLE (WESMLE), 
860-861 
Weighted least squares (WLS) analysis, 60—61 
in cluster sampling, 890-891 
Weighted M-estimator, in stratified sampling, 
856-861 
Weighted multivariate nonlinear least squares 
(WMNLS) estimator, 444—449, 614, 761-762 
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Weighted nonlinear least squares (WNLS) 
estimator, 409-413 
adjustments, 418—420 
asymptotic normality, 411—413 
consistency, 410—411 
variable addition test (VAT) approach, 427—428 
White standard error, 61 
White test, 140 
Wild bootstrap, 441 
Willingness to pay (WTP), 780 


Index 


With probability approaching one (w.p.a.1.), 
39—40 

Working correlation matrix, 448 

Working variance matrix, 446 

X variable. See Explanatory variable (x) 


Y variable, 13, 14 


Zero-conditional mean, 398 


